# Generate simulated data to link

In this case study, we imagine running PVS on the 2030 Census Unedited File (CUF) -- see the main notebook for more details, including references used throughout this notebook.
This notebook creates input (CUF) and reference files approximating what would be used in such a PVS process.

In [1]:
import pseudopeople as psp
import pandas as pd, numpy as np

In [2]:
!pip freeze | grep pseudopeople

## Load simulated data

### Noise configuration

In order to give ourselves more of a challenge, we significantly increase the amount of noise
from the pseudopeople defaults.

In [3]:
default_configuration = psp.get_config()

In [4]:
# Helper functions for changing the default configuration according to a pattern
def column_noise_value(dataset, column, noise_type, default_value):
    if dataset in ('decennial_census', 'taxes_w2_and_1099', 'social_security'):
        if noise_type == "make_typos":
            if column == "middle_initial":
                # 5% of middle initials (which are all a single token anyway) are wrong.
                return {"cell_probability": 0.05, "token_probability": 1}
            elif column in ("first_name", "last_name", "street_name"):
                # 10% of these text columns were entered carelessly, at a rate of 1 error
                # per 10 characters.
                # The pseudopeople default is 1% careless.
                return {"cell_probability": 0.1, "token_probability": 0.1}
        elif noise_type == "write_wrong_digits":
            # 10% of number columns were written carelessly, at a rate of 1 error
            # per 10 characters.
            # The pseudopeople default is 1% careless.
            # Note that this is applied on top of (the default lower levels of) typos,
            # since typos also apply to numeric characters.
            return {"cell_probability": 0.1, "token_probability": 0.1}

    return default_value

def row_noise_value(dataset, noise_type, default_value):
    return default_value

In [5]:
custom_configuration = {
    dataset: {
        noise_category: (
            ({
                column: {
                    noise_type: column_noise_value(dataset, column, noise_type, noise_type_config)
                    for noise_type, noise_type_config in column_config.items()
                }
                for column, column_config in noise_category_config.items()
            }
            if noise_category == "column_noise" else
            {
                noise_type: row_noise_value(dataset, noise_type, noise_type_config)
                for noise_type, noise_type_config in noise_category_config.items()
            })
        )
        for noise_category, noise_category_config in dataset_config.items()
    }
    for dataset, dataset_config in default_configuration.items()
}

### Record ID tracking (data lineage)

We do a little bit of work here to enable tracking the "ground truth" (the simulant IDs from
pseudopeople).
We give each pseudopeople record/row a unique identifier for tracking, and then we immediately
separate the ground truth information (which we would not have if we were using real data)
from the rest of the columns (which we would have).
The ground truth is only used in the specific "ground truth" section of this notebook,
to help avoid accidentally leaking information into the case study.

Since we also combine/aggregate pseudopeople records as part of the process of generating the
simulated PVS reference files, ground truth is a bit more complicated than you might imagine.
For example, the ground truth may tell us that a single row in a reference file is actually
a composite of several individuals, because even the deterministic linkage (by SSN) we use
here is not without error.

We handle this by tracking *all* source records used in the construction of each record in
our reference files.
This is achieved by including a source_record_ids column in all files, which is a tuple
of all these records.
When we aggregate records, this is combined accordingly.

In [6]:
def add_unique_record_id(df, dataset_name):
    df = df.reset_index().rename(columns={'index': 'record_id'})
    df['record_id'] = f'{dataset_name}_' + df.record_id.astype(str)
    return df

# Initializes the source_record_ids column.
# Should only be called on "source records"; that is, records that
# come directly out of pseudopeople.
def record_id_to_single_source_record(df, source_col='record_id'):
    df = df.copy()
    df['source_record_ids'] = df[source_col].apply(lambda x: (x,))
    return df

In [7]:
# Operations that aggregate records, combining the source_record_ids column
# between all records that are aggregated into a single row

def merge_preserving_source_records(dfs, *args, **kwargs):
    dfs = [df.drop(columns=['record_id'], errors='ignore') for df in dfs]
    for df in dfs:
        assert 'source_record_ids' in df.columns

    result = dfs[0]
    for df_to_merge in dfs[1:]:
        result = (
            result.merge(df_to_merge, *args, **kwargs)
                # Get all unique source records that contributed to either record
                # Need to fill nulls here, because an outer join can cause a composite
                # record to be created only from rows in one of the dfs
                .assign(source_record_ids=lambda df: (fillna_empty_tuple(df.source_record_ids_x) + fillna_empty_tuple(df.source_record_ids_y)).apply(set).apply(tuple))
                .drop(columns=['source_record_ids_x', 'source_record_ids_y'])
        )

    return result

# Weirdly, it is quite hard to fill NaNs in a Series with
# the literal value of an empty tuple.
# See https://stackoverflow.com/a/62689667/
def fillna_empty_tuple(s):
    return s.fillna({i: tuple() for i in s.index})

import itertools

def dedupe_preserving_source_records(df, columns_to_dedupe, source_records_col='source_record_ids'):
    return (
        # NOTE: If not for dropna=False, we would silently lose
        # all rows with a null in any of the columns_to_dedupe
        df.groupby(columns_to_dedupe, dropna=False)
            # Concatenate all the tuples into one big tuple
            # https://stackoverflow.com/a/3021851/
            [source_records_col].apply(lambda x: tuple(set(itertools.chain(*x))))
            .reset_index()
    )

### SSA Numident

Wagner and Layne, p.4:

> The reference files are derived from the Social Security Administration
    (SSA) Numerical Identification file (SSA Numident). The Numident contains all
    transactions recorded against one Social Security Number (SSN)...

Based on the [SSA Numident through 2007 which is publicly available from NARA](https://aad.archives.gov/aad/series-description.jsp?s=5057),
we know there are three kinds of transactions: SSN applications, deaths, and claiming benefits.
SSN holders may change their information (e.g. changing name or sex) by submitting another application,
which generates an additional application transaction.
(The policies about this are found [on the SSA website](https://secure.ssa.gov/poms.nsf/lnx/0110212200).)

The paper ["Likely Transgender Individuals in U.S. Federal Administrative Records and the 2010 Census" by Benjamin Cerf Harris](https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-03.pdf)
includes some helpful statistics (Table 2).
The average person in the SSA Numident has 2.2 transactions (called "claims" in that paper, but with the same definition
as our term "transaction": "Any time an SSN is created or information associated with an existing SSN is changed, that event is registered
as a claim.").

pseudopeople does not currently include correction, name change, or benefits claim transactions.
It only includes SSN creation and death of the SSN holder.

I've figured that there would be some delay in getting the Numident -- so by Census processing time
for the 2030 Census, only the SSA transactions by the end of 2029 would be available.
Note that with pseudopeople's current design it is only possible to set this cutoff at the end of a calendar year.
The NORC report says that "the Census NUMIDENT is recreated each year, to reflect
Social Security transaction records through **March** of each year" (p. 105),
though it isn't clear when in the year the Census Numident is actually re-created.

The SSA Numident is supposed to contain a sex column, but it currently doesn't in pseudopeople.

In [8]:
%%time

ssa_numident = psp.generate_social_security(year=2029, config=custom_configuration)
ssa_numident = add_unique_record_id(ssa_numident, 'ssa_numident')
ssa_numident = record_id_to_single_source_record(ssa_numident)
ssa_numident

                                                               

CPU times: user 717 ms, sys: 46.6 ms, total: 764 ms
Wall time: 788 ms




Unnamed: 0,record_id,simulant_id,first_name,middle_initial,last_name,date_of_birth,ssn,event_type,event_date,source_record_ids
0,ssa_numident_0,0_19979,Mary,M,Pierce,12/04/1919,786-77-6454,creation,19191204,"(ssa_numident_0,)"
1,ssa_numident_1,0_6846,Peter,M,Mundell,06/07/1921,688-88-6377,creation,19210607,"(ssa_numident_1,)"
2,ssa_numident_2,0_19941,Anna,H,Causey,03/07/1922,665-25-7858,creation,12220307,"(ssa_numident_2,)"
3,ssa_numident_3,0_19825,Gertrude,M,Osornia,05/11/1922,875-10-2359,creation,19220511,"(ssa_numident_3,)"
4,ssa_numident_4,0_19806,Edna,A,Hunter,05/25/1922,420-19-3737,creation,19220525,"(ssa_numident_4,)"
...,...,...,...,...,...,...,...,...,...,...
20027,ssa_numident_20027,0_23620,Mila,M,Saldana,01/09/2030,133-85-8593,creation,20291218,"(ssa_numident_20027,)"
20028,ssa_numident_20028,0_23629,Luna,N,Bonnell,01/09/2030,422-69-9071,creation,20291218,"(ssa_numident_20028,)"
20029,ssa_numident_20029,0_23630,Charlotte,A,May,01/10/2030,826-03-0946,creation,20291218,"(ssa_numident_20029,)"
20030,ssa_numident_20030,0_23624,Liam,C,Vanover,01/12/2030,778-37-9317,creation,20291218,"(ssa_numident_20030,)"


In [9]:
ssa_numident_ground_truth = ssa_numident.set_index('record_id').simulant_id
ssa_numident = ssa_numident.drop(columns=['simulant_id'])

### W2/1099 tax filings

We assume that the last 10 years of taxes would be available and used in the construction of the reference files --
see section about reference files below.

Note that these are retrieved by *tax* year, so the 2029 taxes would be available in early 2030
(around when our hypothetical case study is taking place).

In [10]:
tax_years = list(range(2020, 2030))
tax_years

[2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029]

In [11]:
%%time

# Combine W2/1099 for all years, adding a tax_year column to track which tax year each row came from.
w2_1099 = pd.concat([
    psp.generate_taxes_w2_and_1099(year=year, config=custom_configuration).assign(tax_year=year) for year in tax_years
], ignore_index=True)
w2_1099 = add_unique_record_id(w2_1099, 'w2_1099')
w2_1099 = record_id_to_single_source_record(w2_1099)
w2_1099

                                                               

CPU times: user 10.6 s, sys: 832 ms, total: 11.4 s
Wall time: 9.91 s


Unnamed: 0,record_id,simulant_id,first_name,middle_initial,last_name,age,date_of_birth,mailing_address_street_number,mailing_address_street_name,mailing_address_unit_number,...,employer_name,employer_street_number,employer_street_name,employer_unit_number,employer_city,employer_state,employer_zipcode,tax_form,tax_year,source_record_ids
0,w2_1099_0,0_4,Michael,M,Ticas,37,03/13/1983,1312,commonwealth avnue,,...,Pikes Creek Campground,e,ince dr,,Anytown,US,00000,W2,2020,"(w2_1099_0,)"
1,w2_1099_1,0_5,Michelle,M,Ticas,39,08/10/1981,1312,commonwealth avnue,,...,Warrensburg,,ne 39th ave,,Anytown,US,00000,W2,2020,"(w2_1099_1,)"
2,w2_1099_2,0_5621,Jeffrey,Z,Quintana,50,07/26/1970,,,,...,France,38,mckenzie hwy,,Anytown,US,00000,W2,2020,"(w2_1099_2,)"
3,w2_1099_3,0_5623,Gloria,A,Quintana,47,07/23/1973,,,,...,Aquarium,2916,4th ave w,,Anytown,US,00000,W2,2020,"(w2_1099_3,)"
4,w2_1099_4,0_5623,Gloria,A,Quintana,47,07/23/1973,,,,...,Nashville City Properties,411,sthe 20th avenue,,Anytown,US,00000,1099,2020,"(w2_1099_4,)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105719,w2_1099_105719,0_3456,Amanda,M,Mitchell,49,02/15/1980,3,goodland avnue,,...,New Era Home,222,w hemlock st,,Anytown,US,00000,W2,2029,"(w2_1099_105719,)"
105720,w2_1099_105720,0_3457,Steven,R,Mitchell,49,03/13/1980,3,goodland avnue,,...,France,2506,mccullough lane,,Anytown,US,00000,W2,2029,"(w2_1099_105720,)"
105721,w2_1099_105721,0_19046,Delbert,D,Hawkins,89,03/15/1940,3,goodland avnue,,...,Ram Fashion Nail,20308,hancock str,,Anytown,US,00000,W2,2029,"(w2_1099_105721,)"
105722,w2_1099_105722,0_19046,Delbert,D,Hawkins,89,03/15/1940,3,goodland avnue,,...,A Car Title Loans,6100,e ball rd,,Anytown,US,00000,W2,2029,"(w2_1099_105722,)"


In [12]:
w2_1099_ground_truth = w2_1099.set_index('record_id').simulant_id
w2_1099 = w2_1099.drop(columns=['simulant_id'])

In [13]:
# "IRS records do not contain DOB, and many of the records contain only the first four letters of the last name."
# (Brown et al. 2023, p.30, footnote 19)
# This should be updated in pseudopeople but for now we do it here.
# Note that the name part only matters for ITIN PIKing since for SSNs that are present in SSA we use name from SSA.
w2_1099 = w2_1099.drop(columns=['date_of_birth'])

PROPORTION_OF_IRS_RECORDS_WITH_TRUNCATION = 0.4 # is this a good guess at "many" in the quote above?
idx_to_truncate = w2_1099.sample(frac=PROPORTION_OF_IRS_RECORDS_WITH_TRUNCATION, random_state=1234).index
w2_1099.loc[idx_to_truncate, 'last_name'] = w2_1099.loc[idx_to_truncate, 'last_name'].str[:4]
w2_1099.loc[idx_to_truncate, 'last_name']

90727    Durh
76728    Brew
43336    Grah
40985    Mill
16803    Prze
         ... 
98072    Vill
39159    Thom
93945    Staa
72673    Hoff
65005    Benj
Name: last_name, Length: 42290, dtype: object

In [14]:
# Slightly hacky workaround for a bug in pseudopeople with the type of the PO box column
# We make sure everything is a string and remove the decimal part
# (we know the decimal point will still be there since there is no noise type that currently affects punctuation)
po_box_fixed = w2_1099.mailing_address_po_box.astype(str).str.replace('\\..*$', '', regex=True).replace('nan', np.nan)

# Floats are only present when it is NaN
assert po_box_fixed[po_box_fixed.apply(type) == float].isnull().all()
# The values that are present never contain decimals
assert not po_box_fixed[(po_box_fixed.apply(type) == str)].str.contains('.', regex=False).any()

po_box_fixed.apply(type).value_counts()

mailing_address_po_box
<class 'float'>    101998
<class 'str'>        3726
Name: count, dtype: int64

In [15]:
po_box_fixed[po_box_fixed.notnull()]

2         14011
3         14011
4         14011
94         6846
134       10475
          ...  
105167    10937
105343    10066
105417      984
105419      984
105449    18713
Name: mailing_address_po_box, Length: 3726, dtype: object

In [16]:
w2_1099['mailing_address_po_box'] = po_box_fixed

### 2030 Census Unedited File (CUF)

For now, we gloss over the data schema for addresses.
We don't know how addresses would be formatted in the CUF (and it's hard to guess, because
address is not part of the Census form), but it likely would have some of these fields
(street number, street name, etc) combined.

While PVS input files do not in general have names split into first, middle, and last,
I am guessing the CUF **would** have first name, middle initial, last name (which is how pseudopeople
generates it), because that [matches the Census questionnaire](https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/questionnaires-and-instructions/questionnaires/2020-informational-questionnaire-english_DI-Q1.pdf).

In [17]:
%%time

census_2030 = psp.generate_decennial_census(year=2030, config=custom_configuration)
census_2030 = add_unique_record_id(census_2030, 'census_2030')
census_2030 = record_id_to_single_source_record(census_2030)
census_2030

                                                               

CPU times: user 632 ms, sys: 26.5 ms, total: 658 ms
Wall time: 620 ms




Unnamed: 0,record_id,simulant_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,relation_to_reference_person,sex,race_ethnicity,source_record_ids
0,census_2030_0,0_923,John,E,Mcueever,86,06/29/1942,147-153,browning ave,,Anytown,US,00000,Reference person,Male,Black,"(census_2030_0,)"
1,census_2030_1,0_2641,Sharon,T,Schmidt,69,10/50/1960,109,stqllion sr,,Anytown,US,00000,Reference person,Female,White,"(census_2030_1,)"
2,census_2030_2,0_6176,Gail,K,Durand,77,01/03/1953,2115,cannon dr,,Anytown,US,00000,Reference person,Female,Multiracial or Other,"(census_2030_2,)"
3,census_2030_3,0_13972,John,J,Williams,81,11/24/1948,146,delaware av,,Anytown,US,00000,Reference person,Male,White,"(census_2030_3,)"
4,census_2030_4,0_13973,Child,L,Wukliamz,81,09/27/1948,146,delaware av,,Anytown,US,00000,Opp-sex spouse,Female,White,"(census_2030_4,)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11048,census_2030_11048,0_22741,Chloe,A,Maryknez-Alvarez,21,07/12/2008,207,harrison st,,Anytown,US,00000,Biological child,Female,Latino,"(census_2030_11048,)"
11049,census_2030_11049,0_22742,Zachary,E,Martinez-Alvarez,18,06/29/2011,207,harrison st,,Anytown,US,00000,Biological child,Male,,"(census_2030_11049,)"
11050,census_2030_11050,0_22743,Madeline,A,Martinez-Alvarez,16,08/12/2013,207,harrison st,,Anytown,US,00000,Biological child,Female,Latino,"(census_2030_11050,)"
11051,census_2030_11051,0_23271,Naomi,A,Martinez-Aldarez,1,11/01/2028,207,harrison st,,Anytown,US,00000,Grandchild,Female,Latino,"(census_2030_11051,)"


In [18]:
census_2030_ground_truth = census_2030.set_index('record_id').simulant_id
census_2030 = census_2030.drop(columns=['simulant_id'])

In [19]:
# Similar to the above issue with PO box, there is weird type stuff with age
age_fixed = census_2030['age'].astype(str).replace('nan', np.nan)

assert age_fixed[age_fixed.apply(type) == float].isnull().all()
assert not age_fixed[age_fixed.apply(type) == str].str.contains('.', regex=False).any()

age_fixed.apply(type).value_counts()

age
<class 'str'>      10963
<class 'float'>       90
Name: count, dtype: int64

In [20]:
age_fixed[age_fixed.notnull()]

0        86
1        69
2        77
3        81
4        81
         ..
11048    21
11049    18
11050    16
11051     1
11052    47
Name: age, Length: 10963, dtype: object

In [21]:
census_2030['age'] = age_fixed

In [22]:
source_record_ground_truth = pd.concat([
    ssa_numident_ground_truth,
    w2_1099_ground_truth,
    census_2030_ground_truth,
])
source_record_ground_truth

record_id
ssa_numident_0       0_19979
ssa_numident_1        0_6846
ssa_numident_2       0_19941
ssa_numident_3       0_19825
ssa_numident_4       0_19806
                      ...   
census_2030_11048    0_22741
census_2030_11049    0_22742
census_2030_11050    0_22743
census_2030_11051    0_23271
census_2030_11052    0_16724
Name: simulant_id, Length: 136809, dtype: object

## Create reference files

> The Census Numident – all Social Security Administration (SSA) Numident SSN records are
  edited (collapsed) to produce a Census Numident file that contains “one best-data record” for
  each SSN. All variants of name information for each SSN are retained in the Alternate Name
  Numident file, while all variants of date of birth data are retained in the Alternate DOB
  Numident. The SSN-PIK crosswalk file is used to attach a corresponding unique PIK value for
  each SSN value in the Census Numident file.

### Census Numident

Luque and Wagner, p. 4:
  
> The SSA Numident file contains all transactions ever recorded against any single SSN - with each entry
representing an addition or change (such as name changes) to the SSN record. This file is edited to
create the **Census Numident**, which contains one record for each SSN. Each SSN record in the Census
Numident contains name, DOB, sex, race, place of birth, parents’ name, citizenship status and date of death information.

and in footnote 5:

> Name edits, DOB reconciliation, and race identifiers are some of the edits conducted to produce this Numident
file. **The resulting Numident file contains the most recent name and DOB data.**

We are missing quite a few columns, since they are missing in pseudopeople's SSA Numident: sex, race, place of birth, parents' name,
citizenship status.
However, I'm pretty sure that out of these columns, only sex is used in the PVS linking process.

In [23]:
def fill_dates(df, fill_with):
    return (
        # Replace invalid dates with nans
        pd.to_datetime(df.event_date, format='%Y%m%d', errors='coerce')
            .fillna(pd.Timestamp('2100-01-01' if fill_with == 'latest' else '1900-01-01'))
    )

def best_data_from_columns(df, columns, best_is_latest=True):
    # We don't want to throw out events with a missing/invalid date, so we'll fill them with the value *least* likely to be chosen
    # (earlier than all values if taking the latest, later than all values if taking the earliest).
    fill_with = 'earliest' if best_is_latest else 'latest'

    return (
        df
            # Without mutating the existing date column, get one that is actually
            # a date type and can be used for sorting.
            .assign(event_date_for_sort=lambda df: fill_dates(df, fill_with=fill_with))
            .sort_values('event_date_for_sort')
            .dropna(subset=columns, how='all')
            .drop_duplicates('ssn', keep=('last' if best_is_latest else 'first'))
            [['record_id', 'ssn'] + columns]
            .pipe(record_id_to_single_source_record)
    )

best_name = best_data_from_columns(
    ssa_numident,
    columns=['first_name', 'middle_initial', 'last_name'],
)

best_date_of_birth = best_data_from_columns(
    ssa_numident,
    columns=['date_of_birth'],
)

best_date_of_death = best_data_from_columns(
    ssa_numident[ssa_numident.event_type == 'death'],
    columns=['event_date'],
).rename(columns={'event_date': 'date_of_death'})

census_numident = merge_preserving_source_records(
    [best_name, best_date_of_birth, best_date_of_death],
    on='ssn',
    how='outer',
)
census_numident = add_unique_record_id(census_numident, 'census_numident')
census_numident

Unnamed: 0,record_id,ssn,first_name,middle_initial,last_name,date_of_birth,date_of_death,source_record_ids
0,census_numident_0,757-60-0267,Roderick,A,Ippolito,07/25/1980,,"(ssa_numident_7231,)"
1,census_numident_1,182-19-6926,Angel,C,Strange,11/10/1993,,"(ssa_numident_10000,)"
2,census_numident_2,366-59-0431,Melissa,A,Holmes,08/05/1989,,"(ssa_numident_9093,)"
3,census_numident_3,749-19-8025,Karen,C,Refd,07/26/1948,,"(ssa_numident_1417,)"
4,census_numident_4,636-88-4449,Sonia,A,Wilson,11/18/1970,,"(ssa_numident_5318,)"
...,...,...,...,...,...,...,...,...
18769,census_numident_18769,613-06-2174,Stephanie,C,Sutherland,07/03/2005,,"(ssa_numident_16219,)"
18770,census_numident_18770,723-66-7906,Irene,J,Huber,05/08/1998,,"(ssa_numident_16470,)"
18771,census_numident_18771,091-69-7427,Michael,M,Dinca,08/01/1981,22220809,"(ssa_numident_17131,)"
18772,census_numident_18772,771-23-1422,Stephen,B,Kistner,12/19/2023,,"(ssa_numident_17635,)"


### Alternate Name Numident

Wagner and Layne, p. 9:

>  All variants of name information for each SSN are retained in the Alternate Name
Numident file...

In [24]:
alternate_name_numident = dedupe_preserving_source_records(ssa_numident, ['ssn', 'first_name', 'middle_initial', 'last_name'])
alternate_name_numident = add_unique_record_id(alternate_name_numident, 'alternate_name_numident')
alternate_name_numident

Unnamed: 0,record_id,ssn,first_name,middle_initial,last_name,source_record_ids
0,alternate_name_numident_0,001-02-4588,Isabella,G,Windom,"(ssa_numident_13600,)"
1,alternate_name_numident_1,001-15-8330,Gerald,J,Beckham,"(ssa_numident_6378,)"
2,alternate_name_numident_2,001-16-0077,Jerald,J,Alvarez,"(ssa_numident_5168,)"
3,alternate_name_numident_3,001-17-9511,Teresa,A,Togni,"(ssa_numident_4507,)"
4,alternate_name_numident_4,001-25-8258,Bethany,G,Tenorio,"(ssa_numident_18645,)"
...,...,...,...,...,...,...
19177,alternate_name_numident_19177,976-30-9537,Aron,C,Frausto Ferretiz,"(ssa_numident_6400,)"
19178,alternate_name_numident_19178,978-78-6109,Claude,M,Page,"(ssa_numident_3813,)"
19179,alternate_name_numident_19179,979-44-7835,Thomas,A,Martinez-Puentes,"(ssa_numident_16649,)"
19180,alternate_name_numident_19180,998-22-9577,Jeffery,P,Shaw,"(ssa_numident_11979,)"


In [25]:
alternate_name_numident.groupby('ssn').size().describe()

count    18774.000000
mean         1.021732
std          0.145812
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          2.000000
dtype: float64

In [26]:
alternate_name_numident[alternate_name_numident.ssn.duplicated(keep=False)].sort_values('ssn')

Unnamed: 0,record_id,ssn,first_name,middle_initial,last_name,source_record_ids
152,alternate_name_numident_152,007-77-9704,Ryan,L,Dmith,"(ssa_numident_8773,)"
153,alternate_name_numident_153,007-77-9704,Ryan,L,Smith,"(ssa_numident_16560,)"
194,alternate_name_numident_194,010-37-8961,Bernita,D,Baker,"(ssa_numident_19321,)"
195,alternate_name_numident_195,010-37-8961,Bernota,D,Baker,"(ssa_numident_3840,)"
216,alternate_name_numident_216,011-41-7757,Rkbert,J,Gutierrez,"(ssa_numident_18181,)"
...,...,...,...,...,...,...
19119,alternate_name_numident_19119,898-35-7624,Parker,J,T,"(ssa_numident_17243,)"
19134,alternate_name_numident_19134,899-04-8024,Anthony,M,Davis,"(ssa_numident_16904,)"
19135,alternate_name_numident_19135,899-04-8024,Anthony,M,Dsvis,"(ssa_numident_3536,)"
19137,alternate_name_numident_19137,899-20-3523,Aillie,D,Ortiz,"(ssa_numident_18213,)"


### Alternate DOB Numident

Wagner and Layne, p. 9:

> ... while all variants of date of birth data are retained in the Alternate DOB
Numident.

In [27]:
alternate_dob_numident = dedupe_preserving_source_records(ssa_numident, ['ssn', 'date_of_birth'])
alternate_dob_numident = add_unique_record_id(alternate_dob_numident, 'alternate_dob_numident')
alternate_dob_numident

Unnamed: 0,record_id,ssn,date_of_birth,source_record_ids
0,alternate_dob_numident_0,001-02-4588,08/08/2008,"(ssa_numident_13600,)"
1,alternate_dob_numident_1,001-15-8330,05/04/1976,"(ssa_numident_6378,)"
2,alternate_dob_numident_2,001-16-0077,02/07/1970,"(ssa_numident_5168,)"
3,alternate_dob_numident_3,001-17-9511,11/20/1966,"(ssa_numident_4507,)"
4,alternate_dob_numident_4,001-25-8258,06/29/2026,"(ssa_numident_18645,)"
...,...,...,...,...
18930,alternate_dob_numident_18930,976-30-9537,06/12/1976,"(ssa_numident_6400,)"
18931,alternate_dob_numident_18931,978-78-6109,05/22/1963,"(ssa_numident_3813,)"
18932,alternate_dob_numident_18932,979-44-7835,08/01/1979,"(ssa_numident_16649,)"
18933,alternate_dob_numident_18933,998-22-9577,04/17/2002,"(ssa_numident_11979,)"


In [28]:
alternate_dob_numident.groupby('ssn').size().describe()

count    18774.000000
mean         1.008576
std          0.092210
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          2.000000
dtype: float64

In [29]:
alternate_dob_numident[alternate_dob_numident.ssn.duplicated(keep=False)].sort_values('ssn')

Unnamed: 0,record_id,ssn,date_of_birth,source_record_ids
15,alternate_dob_numident_15,001-90-0159,02/07/1935,"(ssa_numident_17353,)"
16,alternate_dob_numident_16,001-90-0159,02/37/1935,"(ssa_numident_289,)"
60,alternate_dob_numident_60,003-82-6075,09/02/1922,"(ssa_numident_19450,)"
61,alternate_dob_numident_61,003-82-6075,09/02/1972,"(ssa_numident_5689,)"
185,alternate_dob_numident_185,009-98-2667,09/25/1965,"(ssa_numident_4284,)"
...,...,...,...,...
18778,alternate_dob_numident_18778,893-45-2853,,"(ssa_numident_2851,)"
18835,alternate_dob_numident_18835,896-28-9796,10/08/1947,"(ssa_numident_16054,)"
18836,alternate_dob_numident_18836,896-28-9796,50/08/5947,"(ssa_numident_1309,)"
18886,alternate_dob_numident_18886,898-96-4862,03/10/1924,"(ssa_numident_22,)"


### Name/DOB Reference File

Wagner and Layne, p. 9:

> The Name and DOB Reference files are reformatted versions of the Census Numident
and includes **all possible combinations of alternate names and dates of birth, as well as
ITIN data**. All of the reference files contain SSN/ITIN and the corresponding PIK. When
an input record is linked to a reference file, the corresponding PIK is assigned. Table 1
presents the number of observations in each of the reference files.

A slightly confusing point: sometimes the Name and DOB reference files are described
as one and the same thing, and sometimes as separate.
I believe this is because **they differ only in how they are "cut" for the PVS process:**
the name reference file is cut by first and last initial,
while the DOB reference file is cut by month and day of birth.

This is described in Wagner and Layne, p.15:

> The [DOBSearch] module matches against a re-split
version of the Numident Name Reference file, splitting the data based on month and day
of birth.

Since we handle the logic of "cutting" in the linkage process itself, we generate
a single reference file here.

Note that unlike for addresses, and unlike for the pre-processing of PVS *input* files
(as opposed to reference files), there is no explicit nickname processing/correction here.
I am fairly sure that is accurate to the real PVS, which I believe assumes that nicknames
would not be present in SSA/tax records (or at least, that the real name would appear
at least once in these records).

In [30]:
name_dob_numident_records = merge_preserving_source_records(
    [alternate_name_numident, alternate_dob_numident],
    on='ssn',
    how='outer',
)
name_dob_numident_records

Unnamed: 0,ssn,first_name,middle_initial,last_name,date_of_birth,source_record_ids
0,001-02-4588,Isabella,G,Windom,08/08/2008,"(ssa_numident_13600,)"
1,001-15-8330,Gerald,J,Beckham,05/04/1976,"(ssa_numident_6378,)"
2,001-16-0077,Jerald,J,Alvarez,02/07/1970,"(ssa_numident_5168,)"
3,001-17-9511,Teresa,A,Togni,11/20/1966,"(ssa_numident_4507,)"
4,001-25-8258,Bethany,G,Tenorio,06/29/2026,"(ssa_numident_18645,)"
...,...,...,...,...,...,...
19392,976-30-9537,Aron,C,Frausto Ferretiz,06/12/1976,"(ssa_numident_6400,)"
19393,978-78-6109,Claude,M,Page,05/22/1963,"(ssa_numident_3813,)"
19394,979-44-7835,Thomas,A,Martinez-Puentes,08/01/1979,"(ssa_numident_16649,)"
19395,998-22-9577,Jeffery,P,Shaw,04/17/2002,"(ssa_numident_11979,)"


In [31]:
name_dob_numident_records[name_dob_numident_records.ssn.duplicated(keep=False)].sort_values('ssn')

Unnamed: 0,ssn,first_name,middle_initial,last_name,date_of_birth,source_record_ids
15,001-90-0159,Roger,J,Lolmaugh,02/07/1935,"(ssa_numident_17353, ssa_numident_289)"
16,001-90-0159,Roger,J,Lolmaugh,02/37/1935,"(ssa_numident_17353, ssa_numident_289)"
60,003-82-6075,Wendy,C,Colclough,09/02/1922,"(ssa_numident_19450, ssa_numident_5689)"
61,003-82-6075,Wendy,C,Colclough,09/02/1972,"(ssa_numident_19450, ssa_numident_5689)"
154,007-77-9704,Ryan,L,Dmith,12/28/1987,"(ssa_numident_16560, ssa_numident_8773)"
...,...,...,...,...,...,...
19347,898-96-4862,Carmen,R,Scalzo,03/10/1964,"(ssa_numident_17070, ssa_numident_22)"
19349,899-04-8024,Anthony,M,Davis,12/16/1961,"(ssa_numident_16904, ssa_numident_3536)"
19350,899-04-8024,Anthony,M,Dsvis,12/16/1961,"(ssa_numident_3536, ssa_numident_16904)"
19352,899-20-3523,Aillie,D,Ortiz,01/27/1930,"(ssa_numident_18213, ssa_numident_119)"


#### Incorporating people with ITINs

Individual Taxpayer Identification Numbers (ITINs) can be issued to people who are required to file
federal taxes but are not eligible for a Social Security Number.
The most common reason for this is being an undocumented immigrant and therefore not being authorized
to work in the United States.

People without SSNs used to be impossible to assign PIKs to.
In 2011 the NORC report stated (p. 38, footnote 19):

> NORC understands that the Census Bureau has undertaken an effort to enhance the PVS reference files with IRS
files that include Individual Taxpayer Identification Numbers (ITIN). For those people who are required to file a tax
return but do not have, and may not want an SSN—such as a non-U.S. citizen—the IRS issues the taxpayer an ITIN.
This enhancement to the PVS reference file may help to match more non-U.S citizens.

By 2014 (Wagner and Layne, p. 5):

> One of the key enhancements [made in recent years] increased the coverage of the reference files by
including records for persons with Individual Taxpayer Identification Numbers assigned
by the Internal Revenue Service (ITINs) to [along with?] the SSN-based Numident data. 

I have not found a specific description of how ITIN records are constructed in any of the
publicly-available sources.
This may be because it is straightforward, or because the tax data schema is confidential.
I assume that only IRS data is used, since no other data source that I am aware of would
report ITIN.

It is stated that the ITIN records are created directly from tax filings and not
from ITIN applications (Brown et al. p. 29, footnote 16), which is convenient
because the tax filing data is what we can simulate with
pseudopeople:

> The NUMIDENT provides the PII on the SSN-holder from the issuing agency (SSA), and that PII is used in SSN
verification. **For ITINs, the Census Bureau does not have access to the ITIN applications** to the issuing agency (IRS),
so name and DOB verification of ITINs is less reliable.

"Less reliable" is a bit confusing here, because as stated above when generating
the simulated tax data, IRS data should not contain date of birth at all.
Here, we have stayed true to this by omitting it entirely.

**IMPORTANT LIMITATION:** In general, we do not expect ITINs to show up on W-2/1099 forms, only
on 1040s. In fact, in pseudopeople data ITINs never show up on W-2/1099s, which means that
all "ITINs" found here are actually spurious artifacts of noise.
This logic should be able to handle 1040s, however, when we have those.
We may decide to *only* use 1040s for this purpose.

In [32]:
# Analogous to the process of getting alternate names and dates of birth
# from SSA, we retain all versions of the name from taxes.
name_for_itins = dedupe_preserving_source_records(
    w2_1099[w2_1099.ssn.notnull() & w2_1099.ssn.str.startswith('9')],
    ['ssn', 'first_name', 'middle_initial', 'last_name'],
)
name_for_itins

Unnamed: 0,ssn,first_name,middle_initial,last_name,source_record_ids
0,900-42-6385,Nathan,J,Coffman,"(w2_1099_95770,)"
1,900-50-5059,Fallon,K,Beaudoin,"(w2_1099_24748,)"
2,901-11-4203,Bernice,A,Sant,"(w2_1099_91633,)"
3,901-15-1375,Peter,W,Dodson,"(w2_1099_16976,)"
4,901-39-0254,Brendan,M,Mcna,"(w2_1099_31139,)"
...,...,...,...,...,...
130,996-41-6227,Dorothy,J,Faria,"(w2_1099_20702,)"
131,997-63-6760,Sofia,A,Garc,"(w2_1099_80583,)"
132,997-78-4873,Penny,L,Mccauley,"(w2_1099_10449,)"
133,997-91-2268,Jeremy,,Mast,"(w2_1099_42928,)"


In [33]:
name_for_itins.groupby('ssn').size().describe()

count    133.000000
mean       1.015038
std        0.122162
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        2.000000
dtype: float64

In [34]:
# Make sure these are disjoint sets of SSN values -- this could only happen
# if an SSN value in the Numident were corrupted into the ITIN range
assert set(name_dob_numident_records.ssn) & set(name_for_itins.ssn) == set()

In [35]:
name_dob_reference_file = pd.concat([
    name_dob_numident_records,
    name_for_itins,
], ignore_index=True)
name_dob_reference_file = add_unique_record_id(name_dob_reference_file, 'name_dob_reference_file')
name_dob_reference_file

Unnamed: 0,record_id,ssn,first_name,middle_initial,last_name,date_of_birth,source_record_ids
0,name_dob_reference_file_0,001-02-4588,Isabella,G,Windom,08/08/2008,"(ssa_numident_13600,)"
1,name_dob_reference_file_1,001-15-8330,Gerald,J,Beckham,05/04/1976,"(ssa_numident_6378,)"
2,name_dob_reference_file_2,001-16-0077,Jerald,J,Alvarez,02/07/1970,"(ssa_numident_5168,)"
3,name_dob_reference_file_3,001-17-9511,Teresa,A,Togni,11/20/1966,"(ssa_numident_4507,)"
4,name_dob_reference_file_4,001-25-8258,Bethany,G,Tenorio,06/29/2026,"(ssa_numident_18645,)"
...,...,...,...,...,...,...,...
19527,name_dob_reference_file_19527,996-41-6227,Dorothy,J,Faria,,"(w2_1099_20702,)"
19528,name_dob_reference_file_19528,997-63-6760,Sofia,A,Garc,,"(w2_1099_80583,)"
19529,name_dob_reference_file_19529,997-78-4873,Penny,L,Mccauley,,"(w2_1099_10449,)"
19530,name_dob_reference_file_19530,997-91-2268,Jeremy,,Mast,,"(w2_1099_42928,)"


### GeoBase Reference File

Wagner and Layne, p. 9:

> PVS creates three other sets of reference
files containing Numident data: the **GeoBase Reference File**, the Name Reference File,
and the DOB Reference file.
The GeoBase Reference File appends addresses from administrative records attached
to Numident data, including all possible combinations of alternate names and dates of
birth for SSN. Addresses from administrative records are edited and processed through
commercial software product to clean and standardize address data. ITIN data is also
incorporated into the Geobase.

Luque and Wagner, p. 5:

> Reference files contain data from the Numident file enhanced with address
data obtained from federal AR [administrative records] files.<sup>8</sup>
The reference files, thus, contain all variants of a person’s name, DOB,
and sex, as well as current and recent addresses. These reference files are
referred to as the (PVS) Geobase reference file since addresses (a geographic component)
are appended to each person record.<sup>9</sup> It is important to note that there are
multiple Geobase reference files that are created depending on the vintage of the
incoming file to be processed through PVS.

> <sup>8</sup> Namely, data from the IRS, Department of Housing and Urban Development,
several files from the Department of Health and Human Services, and Selective Service.

> <sup>9</sup> In particular, the address data is cleaned and standardized and used
to construct a variable called GEOKEY. The GEOKEY variable is constructed as a subset
of the full address, and then is appended to the Numident data to create the
PVS Geobase Reference file.

We only have IRS data to use for addresses, and specifically only W-2/1099 data,
which is a limitation of this case study.
I can't find a concrete definition of "recent" -- as noted above, we use 10 years
of IRS data.
I suspect this is longer than the true window, but this may end up making up for
the lack of multiple data sources, and get us closer to a realistic number of
alternate addresses.

Also, our address data comes out of pseudopeople already parsed into address parts
like street name, etc.
For more realism, pseudopeople should output a single string that we have to (imperfectly) parse apart.

I haven't been able to find out more about what kind of "subset" the geokey is.
It is unclear to me why geokey is "interesting" since it is just derived from the
address parts.

In [36]:
address_cols = list(w2_1099.filter(like='mailing_address').columns)

def standardize_address_part(column):
    return (
        column
            # Remove leading or trailing whitespace
            .str.strip()
            # Turn any strings of consecutive whitespace into a single space
            .str.split().str.join(' ')
            # Normalize case
            .str.upper()
            # Normalize the word street as described in the example quoted above
            # In reality, there would be many rules like this
            .str.replace('\b(STREET|STR)\b', 'ST', regex=True)
            # Make sure missingness is represented consistently
            .replace('', np.nan)
    )

addresses_by_ssn = dedupe_preserving_source_records(
    w2_1099.set_index(['ssn', 'source_record_ids'])
        [address_cols]
        .apply(standardize_address_part)
        .reset_index(),
    ['ssn'] + address_cols,
)
addresses_by_ssn

Unnamed: 0,ssn,mailing_address_street_number,mailing_address_street_name,mailing_address_unit_number,mailing_address_po_box,mailing_address_city,mailing_address_state,mailing_address_zipcode,source_record_ids
0,000-74-9102,222,WHITE ROAD,,,ANYTOWN,US,00000,"(w2_1099_58948,)"
1,000-87-0907,2732,BINDL DR,,,ANYTOWN,US,00000,"(w2_1099_80962,)"
2,001-02-4588,685,EMERSON ST,,,ANYTOWN,US,00000,"(w2_1099_101199, w2_1099_101200)"
3,001-15-8330,5010,SOUTH DOCTOR MARTIN LUTHER KING JR DR,,,ANYTOWN,US,00000,"(w2_1099_90573, w2_1099_26682, w2_1099_16464, ..."
4,001-15-8330,5010,SOUTH DOCTOR MARTIN LUTHER KING JR DR,,,ANYTOWN,US,00001,"(w2_1099_16465,)"
...,...,...,...,...,...,...,...,...,...
36658,,,,,788,ANYTOWN,US,00000,"(w2_1099_66811,)"
36659,,,,,9138,ANYTOWN,US,00000,"(w2_1099_23450,)"
36660,,,,,9709,ANYTOWN,US,00000,"(w2_1099_43773,)"
36661,,,,,9859,ANYTOWN,US,00000,"(w2_1099_49373,)"


In [37]:
num_addresses = addresses_by_ssn.groupby('ssn').size().sort_values()
num_addresses

ssn
003-21-2342     1
977-12-4779     1
977-56-9243     1
978-59-7386     1
980-18-7657     1
               ..
349-19-8524     8
134-99-1747     8
062-98-9410     9
211-81-8979     9
049-35-7060    10
Length: 20779, dtype: int64

In [38]:
# Show some SSNs with a lot of address variation
addresses_by_ssn[addresses_by_ssn.ssn.isin(num_addresses.tail(10).index)].sort_values('ssn')

Unnamed: 0,ssn,mailing_address_street_number,mailing_address_street_name,mailing_address_unit_number,mailing_address_po_box,mailing_address_city,mailing_address_state,mailing_address_zipcode,source_record_ids
1897,049-35-7060,186,RT 34,,,ANYTOWN,US,00000,"(w2_1099_34061, w2_1099_34063, w2_1099_23639, ..."
1898,049-35-7060,186,RT 34,,,ANYTOWN,,00000,"(w2_1099_23637,)"
1899,049-35-7060,286,RT 34,,,ANYTOWN,US,00000,"(w2_1099_34062,)"
1900,049-35-7060,34212,PROVINE AVE,,,ANYTOWN,US,00000,"(w2_1099_11788, w2_1099_11787)"
1901,049-35-7060,39412,PROVINE AVE,,,ANYTOWN,US,00000,"(w2_1099_1712,)"
...,...,...,...,...,...,...,...,...,...
33836,852-15-3032,3632,MAPLE GROVE LN,,,ANYTOWN,US,00000,"(w2_1099_11817,)"
33837,852-15-3032,5570,NE 132ND ST,,,ANYTOWN,US,00000,"(w2_1099_21118,)"
33838,852-15-3032,5970,NE 132ND ST,,,ANYTOWN,US,00000,"(w2_1099_31443, w2_1099_52669, w2_1099_31444)"
33839,852-15-3032,5993,NE 132ND ST,,,,US,00000,"(w2_1099_41974,)"


In [39]:
# Rough estimate of how many rows we should have in our reference file, once we do this Cartesian product
(
    len(name_dob_reference_file) *
    addresses_by_ssn.groupby('ssn').size().mean()
)

33693.85148467203

In [40]:
geobase_reference_file = merge_preserving_source_records(
    [name_dob_reference_file, addresses_by_ssn],
    on='ssn',
    how='left',
)
geobase_reference_file = add_unique_record_id(geobase_reference_file, 'geobase_reference_file')
geobase_reference_file

Unnamed: 0,record_id,ssn,first_name,middle_initial,last_name,date_of_birth,mailing_address_street_number,mailing_address_street_name,mailing_address_unit_number,mailing_address_po_box,mailing_address_city,mailing_address_state,mailing_address_zipcode,source_record_ids
0,geobase_reference_file_0,001-02-4588,Isabella,G,Windom,08/08/2008,685,EMERSON ST,,,ANYTOWN,US,00000,"(w2_1099_101199, w2_1099_101200, ssa_numident_..."
1,geobase_reference_file_1,001-15-8330,Gerald,J,Beckham,05/04/1976,5010,SOUTH DOCTOR MARTIN LUTHER KING JR DR,,,ANYTOWN,US,00000,"(w2_1099_90573, w2_1099_26682, w2_1099_16464, ..."
2,geobase_reference_file_2,001-15-8330,Gerald,J,Beckham,05/04/1976,5010,SOUTH DOCTOR MARTIN LUTHER KING JR DR,,,ANYTOWN,US,00001,"(w2_1099_16465, ssa_numident_6378)"
3,geobase_reference_file_3,001-15-8330,Gerald,J,Beckham,05/04/1976,5010,SOUTH DOCTOR NARTIN LURHER KING JR DR,,,ANYTOWN,US,00000,"(w2_1099_58258, ssa_numident_6378)"
4,geobase_reference_file_4,001-16-0077,Jerald,J,Alvarez,02/07/1970,,,,,,,,"(ssa_numident_5168,)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34207,geobase_reference_file_34207,996-41-6227,Dorothy,J,Faria,,1702,MEISNER RD,,,ANYTOWN,US,00000,"(w2_1099_20702,)"
34208,geobase_reference_file_34208,997-63-6760,Sofia,A,Garc,,10900,S K ST,,,ANYTOWN,US,00000,"(w2_1099_80583,)"
34209,geobase_reference_file_34209,997-78-4873,Penny,L,Mccauley,,8370,CHERVIL CT,,,ANYTOWN,US,00000,"(w2_1099_10449,)"
34210,geobase_reference_file_34210,997-91-2268,Jeremy,,Mast,,3232,MAPLE GROVE LN,,,ANYTOWN,US,00000,"(w2_1099_42928,)"


## Track ground truth for reference files

In [41]:
def get_simulants_for_record_ids(record_ids, ground_truth=source_record_ground_truth):
    return tuple(ground_truth.loc[record_id] for record_id in record_ids)

def get_simulants_of_source_records(df, filter_record_ids=None):
    source_record_ids = df.set_index('record_id').source_record_ids
    if filter_record_ids is not None:
        source_record_ids = source_record_ids.apply(lambda r_ids: [r_id for r_id in r_ids if filter_record_ids(r_id)])
    return source_record_ids.apply(get_simulants_for_record_ids).rename('simulant_ids')

# Working with a tuple column is a bit of a pain -- these helpers are for
# operations on this column.
# Oddly, transforming each tuple into a pandas Series (e.g. .apply(pd.Series))
# and using the pandas equivalents of these functions seems to be orders of magnitude slower.

from statistics import multimode

def nunique(data_tuple):
    return len(set(data_tuple))

def mode(data_tuple):
    if len(data_tuple) == 0:
        return None
    return multimode(data_tuple)[0]

### Census Numident

In [42]:
census_numident_simulants = get_simulants_of_source_records(census_numident)
census_numident_simulants

record_id
census_numident_0         (0_3679,)
census_numident_1         (0_8793,)
census_numident_2        (0_15095,)
census_numident_3          (0_978,)
census_numident_4         (0_2684,)
                            ...    
census_numident_18769    (0_20398,)
census_numident_18770    (0_20622,)
census_numident_18771      (0_136,)
census_numident_18772    (0_21633,)
census_numident_18773    (0_21772,)
Name: simulant_ids, Length: 18774, dtype: object

In [43]:
census_numident_simulants.apply(nunique).describe()

count    18774.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: simulant_ids, dtype: float64

In [44]:
census_numident_ground_truth = census_numident_simulants.apply(mode).rename('simulant_id')
census_numident_ground_truth

record_id
census_numident_0         0_3679
census_numident_1         0_8793
census_numident_2        0_15095
census_numident_3          0_978
census_numident_4         0_2684
                          ...   
census_numident_18769    0_20398
census_numident_18770    0_20622
census_numident_18771      0_136
census_numident_18772    0_21633
census_numident_18773    0_21772
Name: simulant_id, Length: 18774, dtype: object

### Alternate Name Numident

In [45]:
source_record_simulants = get_simulants_of_source_records(alternate_name_numident)

In [46]:
source_record_simulants.apply(nunique).describe()

count    19182.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: simulant_ids, dtype: float64

In [47]:
# We take the most common ground truth value.
# Again, as shown above, there are no SSN collisions.
alternate_name_numident_ground_truth = source_record_simulants.apply(mode).rename('simulant_id')
alternate_name_numident_ground_truth

record_id
alternate_name_numident_0        0_13602
alternate_name_numident_1        0_16514
alternate_name_numident_2        0_13906
alternate_name_numident_3        0_13442
alternate_name_numident_4        0_22495
                                  ...   
alternate_name_numident_19177     0_4258
alternate_name_numident_19178    0_19947
alternate_name_numident_19179    0_20792
alternate_name_numident_19180     0_9017
alternate_name_numident_19181    0_12964
Name: simulant_id, Length: 19182, dtype: object

### Alternate DOB Numident

In [48]:
source_record_simulants = get_simulants_of_source_records(alternate_dob_numident)

In [49]:
source_record_simulants.apply(nunique).describe()

count    18935.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: simulant_ids, dtype: float64

In [50]:
alternate_dob_numident_ground_truth = source_record_simulants.apply(mode).rename('simulant_id')
alternate_dob_numident_ground_truth

record_id
alternate_dob_numident_0        0_13602
alternate_dob_numident_1        0_16514
alternate_dob_numident_2        0_13906
alternate_dob_numident_3        0_13442
alternate_dob_numident_4        0_22495
                                 ...   
alternate_dob_numident_18930     0_4258
alternate_dob_numident_18931    0_19947
alternate_dob_numident_18932    0_20792
alternate_dob_numident_18933     0_9017
alternate_dob_numident_18934    0_12964
Name: simulant_id, Length: 18935, dtype: object

### Name/DOB Reference File

In [51]:
source_record_simulants = get_simulants_of_source_records(name_dob_reference_file)

In [52]:
source_record_simulants.apply(nunique).describe()

count    19532.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: simulant_ids, dtype: float64

In [53]:
name_dob_reference_file_ground_truth = source_record_simulants.apply(mode).rename('simulant_id')
name_dob_reference_file_ground_truth

record_id
name_dob_reference_file_0        0_13602
name_dob_reference_file_1        0_16514
name_dob_reference_file_2        0_13906
name_dob_reference_file_3        0_13442
name_dob_reference_file_4        0_22495
                                  ...   
name_dob_reference_file_19527     0_3634
name_dob_reference_file_19528    0_13204
name_dob_reference_file_19529    0_19917
name_dob_reference_file_19530     0_3731
name_dob_reference_file_19531      0_670
Name: simulant_id, Length: 19532, dtype: object

### GeoBase Reference File

In [54]:
source_record_simulants = get_simulants_of_source_records(geobase_reference_file)

In [55]:
# Now there are some collisions, due to "borrowed SSN" noise
source_record_simulants.apply(nunique).describe()

count    34212.000000
mean         1.048346
std          0.229890
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          5.000000
Name: simulant_ids, dtype: float64

In [56]:
# The most collisions on one SSN
most_collisions = geobase_reference_file.set_index('record_id').loc[source_record_simulants.apply(nunique).sort_values().tail(1).index].reset_index().iloc[0]
most_collisions

record_id                                             geobase_reference_file_12857
ssn                                                                    337-34-8000
first_name                                                                   James
middle_initial                                                                   C
last_name                                                            Stewart Ochoa
date_of_birth                                                           08/18/1965
mailing_address_street_number                                                    5
mailing_address_street_name                                           SW 115TH TER
mailing_address_unit_number                                                    NaN
mailing_address_po_box                                                         NaN
mailing_address_city                                                       ANYTOWN
mailing_address_state                                                           US
mail

In [57]:
# Individual tax filings causing those collisions
w2_1099[w2_1099.record_id.isin(most_collisions.source_record_ids)]

Unnamed: 0,record_id,first_name,middle_initial,last_name,age,mailing_address_street_number,mailing_address_street_name,mailing_address_unit_number,mailing_address_po_box,mailing_address_city,...,employer_name,employer_street_number,employer_street_name,employer_unit_number,employer_city,employer_state,employer_zipcode,tax_form,tax_year,source_record_ids
50793,w2_1099_50793,James,C,Stewart Ochoa,59,5,sw 115th ter,,,Anytown,...,Heart School of Jesus Cristo Promesa Behaviora...,3605,johnsontown road,,Anytown,US,0,W2,2024,"(w2_1099_50793,)"
50794,w2_1099_50794,Matt,C,Pineda,52,5,sw 115th ter,,,Anytown,...,La Conasupo Market North Branch Office Depot,242,cushing,apt 3r,Anytown,US,0,W2,2024,"(w2_1099_50794,)"
50795,w2_1099_50795,Matt,C,Pine,52,5,sw 115th ter,,,Anytown,...,400,2200,montrose ave,,Anytown,US,0,W2,2024,"(w2_1099_50795,)"
61481,w2_1099_61481,James,C,Stew,60,5,sw 115th ter,,,Anytown,...,Heart School of Jesus Cristo Promesa Behaviora...,9988,dexter pinckney,,Anytown,US,0,W2,2025,"(w2_1099_61481,)"
72085,w2_1099_72085,James,C,Stewart Ochoa,61,5,sw 115th ter,,,Anytown,...,Heart School of Jesus Cristo Promesa Behaviora...,9988,dexter pinckney,,Anytown,US,0,W2,2026,"(w2_1099_72085,)"
72086,w2_1099_72086,Matt,C,Pine,54,5,sw 115th ter,,,Anytown,...,La Conasupo Market North Branch Office Depot,242,cushing,apt 3r,Anytown,US,0,W2,2026,"(w2_1099_72086,)"
72087,w2_1099_72087,Andrew,M,Rami,44,5,sw 115th ter,,,Anytown,...,North American Galley,43,clubhouse drive,,Anytown,US,0,W2,2026,"(w2_1099_72087,)"
82929,w2_1099_82929,Daniel,R,Neri,66,5,sw 115th ter,,,Anytown,...,A Car Title Loans,6100,e ball rd,,Anytown,US,0,W2,2027,"(w2_1099_82929,)"
82930,w2_1099_82930,James,C,Stew,62,5,sw 115th ter,,,Anytown,...,SGI USA,65,s west ave,,Anytown,US,0,W2,2027,"(w2_1099_82930,)"
82932,w2_1099_82932,Matt,C,Pine,55,5,sw 115th ter,,,Anytown,...,Gershenzon DDS MHS,14,state hwy 200,,Anytown,US,0,W2,2027,"(w2_1099_82932,)"


In [58]:
# Correct value: who actually has the SSN
ssa_numident[ssa_numident.ssn == most_collisions.ssn]

Unnamed: 0,record_id,first_name,middle_initial,last_name,date_of_birth,ssn,event_type,event_date,source_record_ids
4265,ssa_numident_4265,James,C,Stewart Ochoa,08/18/1965,337-34-8000,creation,19650818,"(ssa_numident_4265,)"


In [59]:
ssa_numident_ground_truth.loc[ssa_numident[ssa_numident.ssn == most_collisions.ssn].record_id]

record_id
ssa_numident_4265    0_3938
Name: simulant_id, dtype: object

In [60]:
# We see here that our "mode" approach to ground truth would not be correct --
# the people borrowing the SSN outnumber the person who actually holds it
source_record_simulants.apply(mode).loc[most_collisions.record_id]

'0_3938'

In [61]:
# So, let's prioritize the use of SSA records
source_record_simulants_based_on_ssa_only = get_simulants_of_source_records(geobase_reference_file, filter_record_ids=lambda r_id: 'ssa_numident_' in r_id)

In [62]:
geobase_reference_file_ground_truth = (
    source_record_simulants_based_on_ssa_only.apply(mode)
        # When there are no SSA records (ITIN-based), use the standard mode
        .fillna(source_record_simulants.apply(mode))
        .rename('simulant_id')
)
geobase_reference_file_ground_truth

record_id
geobase_reference_file_0        0_13602
geobase_reference_file_1        0_16514
geobase_reference_file_2        0_16514
geobase_reference_file_3        0_16514
geobase_reference_file_4        0_13906
                                 ...   
geobase_reference_file_34207     0_3634
geobase_reference_file_34208    0_13204
geobase_reference_file_34209    0_19917
geobase_reference_file_34210     0_3731
geobase_reference_file_34211      0_670
Name: simulant_id, Length: 34212, dtype: object

In [63]:
geobase_reference_file_ground_truth.loc[most_collisions.record_id]

'0_3938'

### Get ground truth by SSN

In [64]:
all_ssn_simulant_pairs = pd.concat([
    census_numident.set_index("record_id")[["ssn"]].join(census_numident_ground_truth),
    alternate_name_numident.set_index("record_id")[["ssn"]].join(alternate_name_numident_ground_truth),
    alternate_dob_numident.set_index("record_id")[["ssn"]].join(alternate_dob_numident_ground_truth),
    name_dob_reference_file.set_index("record_id")[["ssn"]].join(name_dob_reference_file_ground_truth),
    geobase_reference_file.set_index("record_id")[["ssn"]].join(geobase_reference_file_ground_truth),
])
all_ssn_simulant_pairs

Unnamed: 0_level_0,ssn,simulant_id
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1
census_numident_0,757-60-0267,0_3679
census_numident_1,182-19-6926,0_8793
census_numident_2,366-59-0431,0_15095
census_numident_3,749-19-8025,0_978
census_numident_4,636-88-4449,0_2684
...,...,...
geobase_reference_file_34207,996-41-6227,0_3634
geobase_reference_file_34208,997-63-6760,0_13204
geobase_reference_file_34209,997-78-4873,0_19917
geobase_reference_file_34210,997-91-2268,0_3731


In [65]:
# The reference file records with a given SSN all have the same (primary) simulant ID
# contributing to them
assert (all_ssn_simulant_pairs.groupby('ssn').simulant_id.nunique() == 1).all()

In [66]:
ssn_to_simulant = all_ssn_simulant_pairs.groupby('ssn').simulant_id.first()
ssn_to_simulant

ssn
001-02-4588    0_13602
001-15-8330    0_16514
001-16-0077    0_13906
001-17-9511    0_13442
001-25-8258    0_22495
                ...   
997-78-4873    0_19917
997-91-2268     0_3731
998-22-9577     0_9017
999-41-1826      0_670
999-80-2455    0_12964
Name: simulant_id, Length: 18907, dtype: object

## Save results

In [67]:
files = {
    'census_2030': (census_2030, census_2030_ground_truth),
    'census_numident': (census_numident, census_numident_ground_truth),
    'alternate_name_numident': (alternate_name_numident, alternate_name_numident_ground_truth),
    'alternate_dob_numident': (alternate_dob_numident, alternate_dob_numident_ground_truth),
    'geobase_reference_file': (geobase_reference_file, geobase_reference_file_ground_truth),
    'name_dob_reference_file': (name_dob_reference_file, name_dob_reference_file_ground_truth),
}

In [68]:
reference_files = [census_numident, alternate_name_numident, alternate_dob_numident, geobase_reference_file, name_dob_reference_file]
# TODO: Rename the ssn column to explicitly include itins, since this is confusing
all_ssns_itins_in_reference_files = pd.concat([df[["ssn"]] for df in reference_files], ignore_index=True)
ssn_to_pik = (
    all_ssns_itins_in_reference_files.drop_duplicates()
        .reset_index().rename(columns={'index': 'pik'})
        .set_index('ssn').pik
)
ssn_to_pik

ssn
757-60-0267        0
182-19-6926        1
366-59-0431        2
749-19-8025        3
636-88-4449        4
               ...  
996-41-6227    91098
997-63-6760    91099
997-78-4873    91100
997-91-2268    91101
999-41-1826    91102
Name: pik, Length: 18907, dtype: int64

In [69]:
pik_to_simulant = (
    ssn_to_simulant.reset_index()
        .assign(pik=lambda df: df.ssn.map(ssn_to_pik))
        .set_index("pik")
        .simulant_id
)
pik_to_simulant

pik
452      0_13602
5866     0_16514
4761     0_13906
4172     0_13442
17384    0_22495
          ...   
91100    0_19917
91101     0_3731
11130     0_9017
91102      0_670
4027     0_12964
Name: simulant_id, Length: 18907, dtype: object

In [70]:
for file_name, (file, ground_truth) in files.items():
    # Add a unique record ID -- could do this within the pipeline, but then it's harder to match up the ground truth
    assert file.record_id.is_unique and ground_truth.index.is_unique
    assert set(file.record_id) == set(ground_truth.index)

    # This tuple column is a pain to serialize
    file = file.drop(columns=['source_record_ids'])

    if file_name != 'census_2030':
        file['pik'] = file.ssn.map(ssn_to_pik)
        assert file.pik.notnull().all()

    ground_truth = ground_truth.reset_index()

    file.to_parquet(f'output/{file_name}_sample.parquet')
    ground_truth.to_parquet(f'output/{file_name}_ground_truth_sample.parquet')

In [71]:
pik_to_simulant.reset_index().to_parquet(f'output/pik_to_simulant_ground_truth.parquet')

In [72]:
# Convert this notebook to a Python script
! cd .. && ./convert_notebook.sh generate_simulated_data/generate_simulated_data_small_sample

[NbConvertApp] Converting notebook generate_simulated_data/generate_simulated_data_small_sample.ipynb to python
