# **Person linkage case study with small-scale sample data**

In this case study, we imagine running a person linkage to add unique identifiers to a simulated 2030 Census Unedited File (CUF).
We emulate the methods of the Census Bureau's Person Identification Validation System (PVS) using publicly available descriptions.
The PVS aims to give people in the input file (the CUF, in this case) unique identifiers that can be used to link them to other administrative records.

We approximate the (highly confidential) input data to PVS by using simulated data from our pseudopeople package.
See the `generate_simulated_data` notebook for details about this.

## Sources

These papers describe PVS as their main subject:

* Wagner and Layne. The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications’ (CARRA) Record Linkage Software. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-01.pdf ([archived](https://web.archive.org/web/20230216043235/https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-01.pdf))
* Layne, Wagner, and Rothhaas. Estimating Record Linkage False Match Rate for the Person Identification Validation System. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf ([archived](https://web.archive.org/web/20220121051156/https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf))
* Alexander et al. Creating a Longitudinal Data Infrastructure at the Census Bureau. 2015. https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/2015-alexander.pdf ([archived](https://web.archive.org/web/20170723192315/http://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/2015-alexander.pdf))
* Massey and O'Hara. Person Matching in Historical Files using the Census Bureau’s Person Validation System. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-11.pdf ([archived](https://web.archive.org/web/20221018074814/http://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-11.pdf))
* NORC. Assessment of the U.S. Census Bureau’s Person Identification Validation System. 2011. https://www.norc.org/content/dam/norc-org/pdfs/PVS%20Assessment%20Report%20FINAL%20JULY%202011.pdf ([archived](https://web.archive.org/web/20230705005935/https://www.norc.org/content/dam/norc-org/pdfs/PVS%20Assessment%20Report%20FINAL%20JULY%202011.pdf))

These apply PVS to some linking task, and in doing so describe it in some detail:

* Brown et al. Real-Time 2020 Administrative Record Census Simulation. 2023. https://www2.census.gov/programs-surveys/decennial/2020/program-management/evaluate-docs/EAE-2020-admin-records-experiment.pdf ([archived](https://web.archive.org/web/20230521191811/https://www2.census.gov/programs-surveys/decennial/2020/program-management/evaluate-docs/EAE-2020-admin-records-experiment.pdf))
* Massey et al. Linking the 1940 U.S. Census with Modern Data. 2018. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530596/ ([DOI](https://doi.org/10.1080%2F01615440.2018.1507772))
* Luque and Wagner. Assessing Coverage and Quality of the 2007 Prototype Census Kidlink Database. 2015. https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-07.pdf ([archived](https://web.archive.org/web/20220808205231/http://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-07.pdf))
* Bond et al. The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-08.pdf ([archived](https://web.archive.org/web/20220803083857/http://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-08.pdf))

Finally, this paper is about the address matching process (MAF Match) which acts as part of PVS:

* Brummet. Comparison of Survey, Federal, and Commercial Address Data Quality. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-06.pdf ([archived](https://web.archive.org/web/20220121213358/https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-06.pdf))

All of these papers are referenced throughout this notebook (and other notebooks) by the authors' names, along with additional information like page numbers.
All page numbers represent those of the PDF files themselves, *not* page numbers printed in the documents.

## PVS overview

Brown et al., p. 28:

> The PVS uses probabilistic record linkage (Fellegi and Sunter, 1969) to match data from an
    incoming file (e.g., a survey or administrative record file) to reference files containing data on
    SSN applications from the NUMIDENT enhanced with address data obtained from other federal
    administrative records.

## Setup

In [1]:
import re, copy, os
import pandas as pd, numpy as np
import jellyfish

In [2]:
data_to_use = 'small_sample'
input_dir = 'generate_simulated_data/output'
output_dir = 'output'
splink_engine = 'duckdb'
spark_master_url = "local[2]"

# Load data

See code in `generate_simulated_data` directory for how we generated the files to link

In [3]:
census_2030 = pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_census_2030.parquet')
geobase_reference_file = pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_geobase_reference_file.parquet')
name_dob_reference_file = pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_name_dob_reference_file.parquet')

# Pre-process input file

Wagner and Layne, p. 9:

> The first step of the PVS process is to edit data fields to make them homogenous for
comparisons between incoming and reference files.

In [4]:
# Input file before any processing; the final result will be this with the PIK column added
census_2030_raw_input = census_2030.copy()

## Name parsing and standardizing

Wagner and Layne, p. 9:

> The first edits are parsing and standardizing - parsing separates fields into
component parts, while standardizing guarantees key data elements are consistent (e.g.,
STREET, STR are both converted to ST). Name and address fields are parsed and
standardized as they are key linkage comparators. 

As noted in the data generation notebook, there is no parsing to be done here:
because the Census questionnnaire asks for first name, middle initial, and last name as
separate fields on the form, the name would be already parsed when running PVS on the CUF.

The real PVS name parser generates fields like prefix (e.g. Mr.) and suffix (e.g. Jr.)
but we never have these in names from pseudopeople.

More details about the name parser and standardizer (Wagner and Layne, p. 10):

> The PVS system incorporates a name
standardizer (McGaughey, 1994), which is a C language subroutine called as a function
within SAS. It performs name parsing and includes a nickname lookup table and outputs
name variants (standardized variations of first and last names). For example, Bill
becomes William, Chuck and Charlie becomes Charles, etc. The PVS keeps both the
original name (Bill) and the converted name (William) for matching. PVS also has a fake
name table to blank names such as “Queen of the House” or “Baby Girl.” The name data
are parsed, checked for nicknames, and standardized.

The key thing that wasn't clear to me from this description was what it meant to
"keep both original name and the converted name for matching."
I found this in Massey et al.:

> The Name Search Module also accounts for instances where census records contain a nickname.
> For these records, the preprocessing step of the Name Search module outputs two records for
> these observations, one record for the nickname and one record for the formal name.
> For example, if the input record has the name “Bill Smith,” the formatting program will add
> a formal name “William” to that record. This record will then output to both the B-S cut and to the W-S cut.

From this, I gather that nicknames work similar to alternate names in the reference file:
entire duplicate records are made with each name (nickname and formal).
However, I am still confused by calling this "the preprocessing step **of the Name Search module**."
In Wagner and Layne it does not seem like this is module-specific, and I don't see any
reason that would be desirable, so I have done this multi-record approach across all modules.

In [5]:
# Nickname processing
# Have not yet found a nickname list in PVS docs,
# so we do a minimal version for now -- could use
# another list such as the one in pseudopeople
# These examples all come directly from examples in the descriptions of PVS
nickname_standardizations = {
    "Bill": "William",
    "Chuck": "Charles",
    "Charlie": "Charles",
    "Cathy": "Catherine",
    "Matt": "Matthew",
}
has_nickname = census_2030.first_name.isin(nickname_standardizations.keys())
print(f'{has_nickname.sum()} nicknames in the Census')

# Add extra rows for the normalized names
census_2030 = pd.concat([
    census_2030,
    census_2030[has_nickname].assign(first_name=lambda df: df.first_name.replace(nickname_standardizations))
], ignore_index=True)

# Note: The above will introduce duplicates on record_id, so we redefine
# record_id to be unique (without getting rid of the original, input file record ID)
def add_unique_record_id(df, dataset_name):
    df = df.reset_index().rename(columns={'index': 'record_id'})
    df['record_id'] = f'{dataset_name}_' + df.record_id.astype(str)
    return df

census_2030 = add_unique_record_id(
    census_2030.rename(columns={'record_id': 'record_id_raw_input_file'}),
    "census_2030_preprocessed",
)

4 nicknames in the Census


In [6]:
# This list of fake names comes from the NORC report, p. 100-101
# It is what was used in PVS as of 2011
with open('fake-names.txt') as f:
    fake_names = pd.Series([name.strip().upper() for name in f.readlines()])

assert (fake_names == fake_names.str.upper()).all()
fake_names

0        (CONFIDENTIAL)
1      (NO MIDDLE NAME)
2           A RELUCTANT
3                 ADULT
4               ADULT F
             ...       
766       YOUNGEST DAUG
767       YOUNGEST GIRL
768              YR BOY
769             YR GIRL
770              YR OLD
Length: 771, dtype: object

In [7]:
for col in ["first_name", "last_name"]:
    has_fake_name = census_2030[col].str.upper().isin(fake_names)
    print(f'Found {has_fake_name.sum()} fake names in {col}')
    census_2030[col] = np.where(
        has_fake_name,
        np.nan,
        census_2030[col]
    )

Found 60 fake names in first_name
Found 57 fake names in last_name


## Address parsing and standardizing

Wagner and Layne, p. 11:

> The PVS editing process also incorporates an address parser and standardizer,
written in the C language and called as a function within SAS (U.S. Census Bureau
Geography Division, 1995). It performs parsing of address strings into individual output
fields (see Figure 2), and standardizes the spelling of key components of the address
such as street type. The PVS also incorporates use of a commercial product to update
zip codes, and correct misspellings of address elements.

As noted in the data generation notebook, for now we haven't combined the address parts
into a single string that would need to be parsed.
We plan to add this in a future version of pseudopeople.

For now, we do some simple standardization in each of the address parts (street number,
street name, etc).

In [8]:
def standardize_address_part(column):
    return (
        column
            # Remove leading or trailing whitespace
            .str.strip()
            # Turn any strings of consecutive whitespace into a single space
            .str.split().str.join(' ')
            # Normalize case
            .str.upper()
            # Normalize the word street as described in the example quoted from
            # Wagner and Layne p. 9
            # In reality, there would be many rules like this
            .str.replace('\b(STREET|STR)\b', 'ST', regex=True)
            # Make sure missingness is represented consistently
            .replace('', np.nan)
    )

address_cols = ['street_number', 'street_name', 'unit_number', 'city', 'state', 'zipcode']
census_2030[address_cols] = census_2030[address_cols].apply(standardize_address_part)

## MAFMatch

**General information about MAFMatch**

Wagner and Layne, p. 11:

> The PVS provides an additional address enhancement by matching records in the
incoming file to Census Bureau’s Master Address File (MAF) in order to assign a unique
address identifier, the MAF Identifier (MAFID), and other Census geographical codes
(e.g., Census tract and block). The MAFID is used in the PVS for search purposes and as a
linkage key for administrative files. Then, addresses are matched to the Census Bureau’s
Topologically Integrated Geographic Encoding and Referencing Database (TIGER) to
obtain Census geographical codes.

My understanding is that this step is useful for two reasons:
- Potentially adds an alternate version/way of formatting the same address
  (the way it is in the MAF, instead of the way it is in the input file).
  However, instead of using this alternate to match directly, it is boiled down
  to a MAFID, so it is only useful when an input file address and a reference file
  address (different to one another) both match to the same MAF address.
  (Presumably, the whole MAFMatch process described here needs to run on the reference files?)
- Adds Census geographies which could be used in blocking (but are they?)

I don't understand the TIGER part of this.
In particular, p. 12 describes a probabilistic/fuzzy match to TIGER, but I thought TIGER
would contain exactly the same addresses as the MAF.
I also don't understand what geocodes would be present in TIGER that wouldn't also be
present in the MAF.
Maybe these things are more out-of-sync with each other than I understood.

**MAFMatch in this case study**

In the decennial Census, the sampling frame is a subset of the MAF (Brown et al., p. 15, footnote 4).
That is, in the CUF, the address field (which is not supplied by the respondent)
would be the MAF address the questionnaire/NRFU was sent to.

Therefore, I don't believe it makes sense to do MAFMatch for the CUF, because all the addresses
are already the same.
I haven't fully confirmed this, but p. 14 of Wagner and Layne says

> The 2010 Census Unedited File (CUF), had 350 million records and processed through every PVS
module, excluding MAF match and SSN verification

which suggests that (like SSN verification) MAFMatch is not applicable to the CUF.

Given this, we skip MAFMatch here.
Also, if we wanted to add it, we would need to generate something like a MAF from pseudopeople.

## Drop records with insufficient information

In 2011 (NORC, p. 25):

> The initial edit process, described in the Introduction: PVS Background section, removes from consideration
incoming records that have no name data. Therefore, no record that is processed in PVS has blank first and last
names.

In 2014 it appears to be the same/similar, because in Table 2 of Wagner and Layne there is a row "NO SEARCH: Blank Name" (p. 18).

In 2023 (Brown et al., p. 28):

> Records containing sufficient PII to be linkable with some confidence, for example those
containing name and age, are sent through the linkage process.<sup>15</sup>

> <sup>15</sup> Records with names on the PVS invalid name list (e.g., “Mickey Mouse,” “householder,” or “son”) are excluded
from PVS search.

I'd prefer to use the 2023 information, but it is too vague:
"for example" means this is just an approximation, and it doesn't specify
parts of "name."
The footnote also seems to contradict earlier reports, which said fake names were simply
blanked out, which seems preferable.
Since the fake name step comes before this one, a fake first **and** last name would lead
to exclusion here, which is perhaps what the footnote was intending?

Here, we follow the blank-name approach.

In [9]:
census_2030 = census_2030[
    census_2030.first_name.notnull() |
    census_2030.last_name.notnull()
]

# Create derived variables for use in linkage

Here we create variables to be used as matching variables and blocking keys,
when those matching variables/blocking keys are defined in a way that is not already
present in the data at this point.

The variables needed here depend on the modules and passes described below -- see those
sections for more citations.

In [10]:
# We want to compare mailing address with physical address
geobase_reference_file = geobase_reference_file.rename(columns=lambda c: c.replace('mailing_address_', ''))

In [11]:
# PVS uses DOB as separate fields for day, month, and year
def split_dob(df, date_format='%Y%m%d'):
    df = df.copy()
    # Have to be floats because we want to treat as numeric for assessing similarity
    # Note that as of now, none of our pseudopeople noise types would change the punctuation ("/") in the date, but
    # they can insert non-numeric characters here or otherwise create invalid dates, in which case we fail to parse the date
    # and treat it as missing.
    dob = pd.to_datetime(df.date_of_birth, format=date_format, errors='coerce')
    df['month_of_birth'] = dob.dt.month
    df['year_of_birth'] = dob.dt.year
    df['day_of_birth'] = dob.dt.day
    return df.drop(columns=['date_of_birth'])

census_2030 = split_dob(census_2030, date_format='%m/%d/%Y')
geobase_reference_file = split_dob(geobase_reference_file)
name_dob_reference_file = split_dob(name_dob_reference_file)

In [12]:
# I don't fully understand the purpose of blocking on the geokey,
# as opposed to just blocking on its constituent columns.
# Maybe it is a way of dealing with missingness in those constituent
# columns (e.g. so an address with no unit number can still be blocked on geokey)?
def add_geokey(df):
    df = df.copy()
    df['geokey'] = (
        df.street_number + ' ' +
        df.street_name + ' ' +
        df.unit_number.fillna('') + ' ' +
        df.city + ' ' +
        df.state.astype(str) + ' ' +
        df.zipcode
    )
    # Normalize the whitespace -- necessary if the unit number was null
    df['geokey'] = (
        df.geokey.str.split().str.join(' ')
    )
    return df

geobase_reference_file = add_geokey(geobase_reference_file)
census_2030 = add_geokey(census_2030)

In [13]:
# Layne, Wagner, and Rothhaas p. 26: the name matching variables are
# First 15 characters First Name, First 15 characters Middle Name, First 12 characters Last Name
# Additionally, there are blocking columns for all of 1-3 initial characters of First/Last.
# We don't have a full middle name in pseudopeople (nor would that be present in a real CUF)
# so we have to stick to the first initial for middle.
def add_truncated_name_cols(df):
    df = df.copy()
    df['first_name_15'] = df.first_name.str[:15]
    df['last_name_12'] = df.last_name.str[:12]

    if 'middle_name' in df.columns and 'middle_initial' not in df.columns:
        df['middle_initial'] = df.middle_name.str[:1]

    for num_chars in [1, 2, 3]:
        df[f'first_name_{num_chars}'] = df.first_name.str[:num_chars]
        df[f'last_name_{num_chars}'] = df.last_name.str[:num_chars]

    return df

census_2030 = add_truncated_name_cols(census_2030)
geobase_reference_file = add_truncated_name_cols(geobase_reference_file)
name_dob_reference_file = add_truncated_name_cols(name_dob_reference_file)

In [14]:
# Layne, Wagner, and Rothhaas p. 26: phonetics are used in blocking (not matching)
# - Soundex for Street Name
# - NYSIIS code for First Name
# - NYSIIS code for Last Name
# - Reverse Soundex for First Name
# - Reverse Soundex for Last Name

def add_name_phonetics(df):
    df = df.copy()

    for col in ['first_name', 'last_name']:
        df[f'{col}_nysiis'] = df[col].dropna().apply(jellyfish.nysiis)
        df[f'{col}_reverse_soundex'] = df[col].dropna().str[::-1].apply(jellyfish.soundex)

    return df

def add_address_phonetics(df):
    df = df.copy()
    df['street_name_soundex'] = df.street_name.dropna().apply(jellyfish.soundex)
    return df

census_2030 = add_name_phonetics(census_2030)
census_2030 = add_address_phonetics(census_2030)

geobase_reference_file = add_address_phonetics(geobase_reference_file)

name_dob_reference_file = add_name_phonetics(name_dob_reference_file)

In [15]:
# Columns used to "cut the database": ZIP3 and a grouping of first and last initial
def add_zip3(df):
    return df.assign(zip3=lambda x: x.zipcode.str[:3])

def add_first_last_initial_categories(df):
    # Page 20 of the NORC report: "Name-cuts are defined by combinations of the first characters of the first and last names. The twenty letter groupings
    # for the first character are: A-or-blank, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, and U-Z."
    initial_cut = lambda x: x.fillna('A').str[0].replace('A', 'A-or-blank').replace(['U', 'V', 'W', 'X', 'Y', 'Z'], 'U-Z')
    return df.assign(first_initial_cut=lambda x: initial_cut(x.first_name), last_initial_cut=lambda x: initial_cut(x.last_name))

In [16]:
census_2030 = add_zip3(census_2030)
census_2030 = add_first_last_initial_categories(census_2030)

geobase_reference_file = add_zip3(geobase_reference_file)

name_dob_reference_file = add_first_last_initial_categories(name_dob_reference_file)

# Data, ready to link

In [17]:
census_2030

Unnamed: 0,record_id,record_id_raw_input_file,household_id,first_name,middle_initial,last_name,age,street_number,street_name,unit_number,...,first_name_3,last_name_3,first_name_nysiis,first_name_reverse_soundex,last_name_nysiis,last_name_reverse_soundex,street_name_soundex,zip3,first_initial_cut,last_initial_cut
0,census_2030_preprocessed_0,simulated_census_2030_0,0_8033,Gerald,R,Allen,86,1130,MALLORY LN,,...,Ger,All,GARALD,D462,ALAN,N400,M464,000,G,A-or-blank
1,census_2030_preprocessed_1,simulated_census_2030_1,0_1066,April,S,Hayden,33,32597,DELACORTE DR,,...,Apr,Hay,APRAL,L610,HAYDAN,N300,D426,000,A-or-blank,H
2,census_2030_preprocessed_2,simulated_census_2030_2,0_1066,Loretta,T,Lowe,71,32597,DELACORTE DR,,...,Lor,Low,LARAT,A364,LAO,E400,D426,000,L,L
3,census_2030_preprocessed_3,simulated_census_2030_3,0_2514,Sandra,A,Sorrentino,75,4458,WIBDSOR PL,,...,San,Sor,SANDR,A635,SARANTAN,O535,W132,000,S,S
4,census_2030_preprocessed_4,simulated_census_2030_4,0_5627,Bobby,S,Baker,44,,WINDING TRAIL RD,,...,Bob,Bak,BABY,Y110,BACAR,R210,W535,000,B,B
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11028,census_2030_preprocessed_11028,simulated_census_2030_11028,0_10693,Zechariah,C,Deshpande,0,1534,BENTLEY DR,,...,Zec,Des,ZACAR,H622,DASPAND,E351,B534,000,U-Z,D
11029,census_2030_preprocessed_11029,simulated_census_2030_3803,0_492,Charles,P,Foreman,64,224,N SANFORD ST,,...,Cha,For,CARL,S462,FARANAN,N561,N251,000,C,F
11030,census_2030_preprocessed_11030,simulated_census_2030_6806,0_11238,William,S,Vick,59,829,BERKELEY AVE,,...,Wil,Vic,WALAN,M400,VAC,K100,B624,000,U-Z,U-Z
11031,census_2030_preprocessed_11031,simulated_census_2030_7705,0_6764,Charles,P,Qusmus,63,6801,REDWOOD TERRACE,,...,Cha,Qus,CARL,S462,QASN,S522,R333,000,C,Q


In [18]:
geobase_reference_file

Unnamed: 0,record_id,ssn,first_name,middle_name,last_name,street_number,street_name,unit_number,po_box,city,...,last_name_12,middle_initial,first_name_1,last_name_1,first_name_2,last_name_2,first_name_3,last_name_3,street_name_soundex,zip3
0,simulated_geobase_reference_file_0,685-77-0916,Betty,Audrey,Keel,47461,W ROXBURY DR,,,ANYTOWN,...,Keel,A,B,K,Be,Ke,Bet,Kee,W621,000
1,simulated_geobase_reference_file_1,765-44-4521,Ethep,Nancy,Collier,,,,,,...,Collier,N,E,C,Et,Co,Eth,Col,,
2,simulated_geobase_reference_file_2,765-44-4521,Ethep,Nancy,Collier,,,,,,...,Collier,N,E,C,Et,Co,Eth,Col,,
3,simulated_geobase_reference_file_3,726-57-2168,Josephine,Margaret,Babbie,3743,MESA VERDE ST,,,ANYTOWN,...,Babbie,M,J,B,Jo,Ba,Jos,Bab,M216,000
4,simulated_geobase_reference_file_4,726-57-2168,Josephine,Margaret,Babbie,3743,MESA VERDE ST,,,,...,Babbie,M,J,B,Jo,Ba,Jos,Bab,M216,000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32820,simulated_geobase_reference_file_32820,939-53-1702,Josiah,L,Mcmiller,2904,YACHT CLUB PT,,,ANYTOWN,...,Mcmiller,L,J,M,Jo,Mc,Jos,Mcm,Y232,000
32821,simulated_geobase_reference_file_32821,939-53-1702,Josiah,L,Mcmiller,W14066,E VALENCIA RD,,,ANYTOWN,...,Mcmiller,L,J,M,Jo,Mc,Jos,Mcm,E145,000
32822,simulated_geobase_reference_file_32822,939-53-1702,Josiah,L,Mcmiller,W14066,E VALENCIA RD,,,ANYTOWN,...,Mcmiller,L,J,M,Jo,Mc,Jos,Mcm,E145,000
32823,simulated_geobase_reference_file_32823,993-56-2286,Larisa,A,Beilfuss,1411,W 80TH ST,,,ANYTOWN,...,Beilfuss,A,L,B,La,Be,Lar,Bei,W323,000


In [19]:
name_dob_reference_file

Unnamed: 0,record_id,ssn,first_name,middle_name,last_name,pik,month_of_birth,year_of_birth,day_of_birth,first_name_15,...,first_name_2,last_name_2,first_name_3,last_name_3,first_name_nysiis,first_name_reverse_soundex,last_name_nysiis,last_name_reverse_soundex,first_initial_cut,last_initial_cut
0,simulated_name_dob_reference_file_0,685-77-0916,Betty,Audrey,Keel,105906,12.0,1922.0,5.0,Betty,...,Be,Ke,Bet,Kee,BATY,Y310,CAL,L200,B,K
1,simulated_name_dob_reference_file_1,765-44-4521,Ethep,Nancy,Collier,104653,,,,Ethep,...,Et,Co,Eth,Col,ETAP,P300,CALAR,R420,E,C
2,simulated_name_dob_reference_file_2,765-44-4521,Ethep,Nancy,Collier,104653,2.0,1923.0,13.0,Ethep,...,Et,Co,Eth,Col,ETAP,P300,CALAR,R420,E,C
3,simulated_name_dob_reference_file_3,726-57-2168,Josephine,Margaret,Babbie,106223,7.0,1923.0,28.0,Josephine,...,Jo,Ba,Jos,Bab,JASAFAN,E512,BABY,E110,J,B
4,simulated_name_dob_reference_file_4,365-44-3027,Betty,Mary,,107896,8.0,1923.0,9.0,Betty,...,Be,,Bet,,BATY,Y310,,,B,A-or-blank
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19870,simulated_name_dob_reference_file_19870,977-84-6791,Liam,L,Wheatley,108816,,,,Liam,...,Li,Wh,Lia,Whe,LAN,M400,WATLY,Y430,L,U-Z
19871,simulated_name_dob_reference_file_19871,997-95-0405,Joseph,D,Dellinger,108817,,,,Joseph,...,Jo,De,Jos,Del,JASAF,H122,DALANGAR,R254,J,D
19872,simulated_name_dob_reference_file_19872,939-53-1702,Josiah,L,Mcmiller,108818,,,,Josiah,...,Jo,Mc,Jos,Mcm,JAS,H220,MCNALAR,R452,J,M
19873,simulated_name_dob_reference_file_19873,993-56-2286,Larisa,A,Beilfuss,108819,,,,Larisa,...,La,Be,Lar,Bei,LARAS,A264,BALF,S141,L,B


# Emulate Multi-Match with splink

Wagner and Layne, p. 8:

> The PVS employs its probabilistic record linkage software, Multi-Match (Wagner
2012), as an integral part of the PVS.

Wagner and Layne, p. 12:

> PVS uses the same Multi-Match engine for each probabilistic search type. For each
search module the analyst defines a parameter file, which is passed to Multi-Match. The
parameter file includes threshold value(s) for the number of passes, blocking keys, and
within each pass, the match variables, match comparison type, and matching weights...
>
> Records must first match exactly on the blocking keys before any comparisons
between the match variables are attempted. Each match variable is given an
m and
u
probability, which is translated by MultiMatch as agreement and disagreement weights.
The sum of all match variable comparison weights for a record pair is the composite
weight. All record pairs with a composite weight greater than or equal to the threshold
set in the parameter file are linked, and the records from the incoming file for these
linked cases are excluded from all remaining passes. All Numident records are always
available for linking in every pass. Any record missing data for any of the blocking fields
for a pass skips that pass and moves to the next pass.

[splink](https://github.com/moj-analytical-services/splink) is similar to Multi-Match
(both are based on the Fellegi-Sunter approach to record linkage).
It also implements:
- Exact blocking on specified keys
- Determining overall match probability based on conditional independence of individual field comparisons
  (it is equivalent to multiply the probability ratios, or sum the logarithmic "weights" as Multi-Match describes it),
  which each have m and u probabilities
- Setting a match threshold

However, splink does not include some additional logic built into Multi-Match,
specifically the ability to run multiple "passes," removing linked records from
subsequent passes.
We have to implement that ourselves, calling splink again in each pass.

## Estimate parameters (lambda, m, u) once for all modules

In Multi-Match, there is no lambda (prior probability of a link, before any comparisons are observed).
This is because Multi-Match works entirely in weight space, and sets different weight thresholds
in different modules/passes instead of changing the prior probability.

In Multi-Match parameters are not directly estimated from the data.
They are primarily set manually by analysts, with a different set of parameter files
maintained for each type of input file (e.g. survey, administrative).

All parameters except lambda are column-level parameters.
We estimate the column-level parameters using the GeoBase Reference File,
since it has all columns that are used for matching with any reference file.

We estimate lambda (manually, see below) separately by reference file,
since the number of records is different between the reference files but
the size of the underlying population (people we are trying to match) is constant.

### lambda (prior probability of a match)

Splink has a built-in method (estimate_probability_two_random_records_match)
for estimating this, but it did not seem to give me reasonable estimates.

We just make an informed guess here based on how much overlap we expect
in the files, how much unintentional duplication we expect in the files (e.g. someone
being enumerated twice), how much *intentional* duplication we have in the
files (e.g. a record for each nickname variant), and some simplifying assumptions.

Our assumptions:
- 5% of the enumerations in the CUF are unintentional duplicates
- 0.5% of the PIKs in the reference file are unintentional duplicates
  (that same person is also represented with a different PIK)
- 90% of the people in the CUF are represented in the reference files
- Being represented in both files is independent of being unintentionally
  or intentionally duplicated
- Being intentionally duplicated in one file is independent of being intentionally
  duplicated in the other (likely not true since having a name with lots of variants
  would cause intentional duplicates in both)

We do this first (before m and u probabilities) because having lambda estimated is
useful to the EM algorithm for estimating m and u.

In [20]:
def estimate_number_true_matches(input_file, reference_file):
    people_represented_in_input_file = (
        input_file.record_id_raw_input_file.nunique() * 0.95
    )
    people_represented_in_reference_file = (
        reference_file.pik.nunique() * 0.995
    )
    people_represented_in_both = people_represented_in_input_file * 0.9

    # Assuming independence conditions as noted above, the number of true
    # matches that should be found for *each* true person is the expected number
    # of records in one file times the expected number of records in the other
    input_file_records_per_person = people_represented_in_input_file / len(input_file)
    reference_file_records_per_person = people_represented_in_reference_file / len(reference_file)
    record_matches_per_person = input_file_records_per_person * reference_file_records_per_person

    return people_represented_in_both * record_matches_per_person

def probability_two_random_records_match(input_file, reference_file):
    cartesian_product = len(input_file) * len(reference_file)
    return (
        estimate_number_true_matches(input_file, reference_file) /
        cartesian_product
    )

probability_two_random_records_match(census_2030, geobase_reference_file)

1.4225305711259186e-05

### m and u probabilities

In [21]:
common_cols = [c for c in census_2030.columns if c in geobase_reference_file.columns or c in name_dob_reference_file.columns]
common_cols

['record_id',
 'first_name',
 'middle_initial',
 'last_name',
 'street_number',
 'street_name',
 'unit_number',
 'city',
 'state',
 'zipcode',
 'month_of_birth',
 'year_of_birth',
 'day_of_birth',
 'geokey',
 'first_name_15',
 'last_name_12',
 'first_name_1',
 'last_name_1',
 'first_name_2',
 'last_name_2',
 'first_name_3',
 'last_name_3',
 'first_name_nysiis',
 'first_name_reverse_soundex',
 'last_name_nysiis',
 'last_name_reverse_soundex',
 'street_name_soundex',
 'zip3',
 'first_initial_cut',
 'last_initial_cut']

In [22]:
def prep_table_for_splink(df, dataset_name, columns):
    return (
        df[[c for c in df.columns if c in columns]]
            .assign(dataset_name=dataset_name)
    )

tables_for_splink = [
    prep_table_for_splink(geobase_reference_file, "geobase_reference_file", common_cols),
    prep_table_for_splink(census_2030, "census_2030", common_cols)
]

In [23]:
[len(t) for t in tables_for_splink]

[32825, 11030]

#### Define comparison variables and levels

**Variables**

The most recent report, from 2023 (Brown et al., p. 29), says:

> [In] GeoSearch... The typical matching variables are name, DOB, sex, and
  various address fields...
>
> NameSearch... uses only name and DOB fields...
> 
> DOBSearch... compares name, sex, and DOB...
> 
> the Household Composition search module... attempts to find a match... based on name and DOB information

So, the total list: name, DOB, sex, and "various address fields."
These address fields are listed in the PVS report (p. 34) which is admittedly from 2011:

> variables used directly in matching the input file with the reference files [include]...
> street name, street name prefix and suffix, house number, rural route and box, and ZIP code

**Comparisons**

Massey et al. footnote 2: 
> The PVS string comparator was developed by Winkler (1995) and measures the distance
  between two strings on a scale from 0 to 900, where a distance score of 0 is given if
  there is no similarity between two text strings and a score of 900 is given for an exact match.
  The cutoff value for the string distance is set to 750 in the Name Search module.

Massey and O'Hara, p. 6 footnote 7:
> For the comparison of text strings, a prorated value
  between the chosen agreement score and chosen disagreement score is given depending on the Jaro-Winkler
  distance between the string in the input file and reference file.

Massey et al.:
> For numeric variables, such as year of birth, a maximum acceptable difference
> between the variable value in the input and reference record is set. This also
> allows for creation of an interval, or band, around year of birth to permit
> inexact matches. Within this band, prorated agreement and disagreement weights
> are assigned depending on the similarity of year of birth.

The "maximum acceptable distance" implies that the m probability beyond that distance is
zero; such pairs should never be linked.

Note that this continuous "prorated value," as opposed to categorizing comparison levels,
is not possible in splink and goes outside the traditional F-S framing of m and u probabilities!
We omit it for now.

In [24]:
if splink_engine == 'duckdb':
    from splink.duckdb.comparison_library import (
        exact_match,
        jaro_winkler_at_thresholds,
    )
    import splink.duckdb.comparison_level_library as cll
elif splink_engine == 'spark':
    from splink.spark.comparison_library import (
        exact_match,
        jaro_winkler_at_thresholds,
    )
    import splink.spark.comparison_level_library as cll

def numeric_column_comparison(col_name, human_name, maximum_inexact_match_difference):
    return {
        "output_column_name": col_name,
        "comparison_description": human_name,
        "comparison_levels": [
            cll.null_level(col_name),
            cll.exact_match_level(col_name),
            {
                "sql_condition": f"abs({col_name}_l - {col_name}_r) <= {maximum_inexact_match_difference}",
                "label_for_charts": "Inexact match",
            },
            cll.else_level(),
        ],
    }

numeric_columns = ["day_of_birth", "month_of_birth", "year_of_birth"]

settings = {
    "link_type": "link_only",
    "comparisons": [
        jaro_winkler_at_thresholds("first_name_15", 750 / 900),
        jaro_winkler_at_thresholds("last_name_12", 750 / 900),
        exact_match("middle_initial"),
        numeric_column_comparison("day_of_birth", "Day of birth", maximum_inexact_match_difference=5),
        numeric_column_comparison("month_of_birth", "Month of birth", maximum_inexact_match_difference=3),
        numeric_column_comparison("year_of_birth", "Year of birth", maximum_inexact_match_difference=5),
        # Using same cutoffs as for names, in the absence of a better description of
        # how these are compared
        jaro_winkler_at_thresholds("street_number", 750 / 900),
        jaro_winkler_at_thresholds("street_name", 750 / 900),
        jaro_winkler_at_thresholds("unit_number", 750 / 900),
        jaro_winkler_at_thresholds("zipcode", 750 / 900),
    ],
    "probability_two_random_records_match": probability_two_random_records_match(census_2030, geobase_reference_file),
    "unique_id_column_name": "record_id",
    # Not sure exactly what this does, but it is necessary for some of the fancier graphs below
    "retain_intermediate_calculation_columns": True,
}

if splink_engine == 'duckdb':
    from splink.duckdb.linker import DuckDBLinker

    linker = DuckDBLinker(
        tables_for_splink,
        settings,
        input_table_aliases=["reference_file", "census_2030"]
    )
elif splink_engine == 'spark':
    # https://moj-analytical-services.github.io/splink/demos/examples/spark/deduplicate_1k_synthetic.html
    from splink.spark.jar_location import similarity_jar_location

    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
    from pyspark.sql import types

    conf = SparkConf()
    conf.setMaster(os.getenv("LINKER_SPARK_MASTER_URL", spark_master_url))
    conf.set("spark.driver.memory", "12g")
    conf.set("spark.default.parallelism", "2")

    # Add custom similarity functions, which are bundled with Splink
    # documented here: https://github.com/moj-analytical-services/splink_scalaudfs
    path = similarity_jar_location()
    conf.set("spark.jars", path)

    sc = SparkContext.getOrCreate(conf=conf)

    spark = SparkSession(sc)
    spark.sparkContext.setCheckpointDir("./tmpCheckpoints")

    from splink.spark.linker import SparkLinker
    linker = SparkLinker(
        tables_for_splink,
        settings,
        input_table_aliases=["reference_file", "census_2030"],
        spark=spark,
    )

    import warnings
    # PySpark triggers a lot of Pandas warnings
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    warnings.filterwarnings("ignore", category=FutureWarning)

#### Estimate u probabilities

This method seems to work well:
- For almost all of the columns, an exact or even inexact match is relatively rare
- Exact matches are common on ZIP code (makes sense, given our sample data is all in one ZIP)
- Inexact matches are more common where you would expect (day and month of birth have highly constrained values)

I have tested that this estimation is reproducible when run multiple times (with the same seed).

In [25]:
%%time

linker.estimate_u_using_random_sampling(max_pairs=1e7, seed=1234)

----- Estimating u probabilities using random sampling -----


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name_15 (no m values are trained).
    - last_name_12 (no m values are trained).
    - middle_initial (no m values are trained).
    - day_of_birth (no m values are trained).
    - month_of_birth (no m values are trained).
    - year_of_birth (no m values are trained).
    - street_number (no m values are trained).
    - street_name (no m values are trained).
    - unit_number (no m values are trained).
    - zipcode (no m values are trained).


CPU times: user 6.8 s, sys: 66.4 ms, total: 6.87 s
Wall time: 3.6 s


In [26]:
# Ignore the green bars on the left, these are the m probabilities that haven't been estimated yet
linker.m_u_parameters_chart()

#### Estimate m probabilities

The EM algorithm implemented in Splink can be used to estimate all the parameters at once.
However, I've found this to be extremely unstable, and the lambda and u estimates I've made
above much better than what happens when I allow EM to mess with them.

These EM session blocking rules are the result of **lots** of trial and error.
I consistently had problems with the EM algorithm deciding that first names
were completely wrong ("All other comparisons") almost as often as they were
right.
I can think of two reasons this might be happening:
- The EM algorithm is just looking for two coherent clusters, and the assumption is that these
  clusters correspond to matches and non-matches.
  But in real life, and in our simulated data, matches and non-matches are not homogenous.
  In particular, people who are in the same family live together and tend to share a last name,
  and this could have been the cluster EM was finding instead of finding the exact same person.
- Conditional independence is grossly violated by some of our columns, especially the address parts
  and the DOB parts. This could be causing some pathological behavior.

Using the EM approach below, I have at least obtained reasonable-looking m and u probabilities.

In [27]:
blocking_rule_for_training = "l.first_name_15 = r.first_name_15"
em_session_1 = linker.estimate_parameters_using_expectation_maximisation(
    blocking_rule_for_training,
    # Fix lambda; u is fixed by default
    fix_probability_two_random_records_match=True,
)


----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.first_name_15 = r.first_name_15

Parameter estimates will be made for the following comparison(s):
    - last_name_12
    - middle_initial
    - day_of_birth
    - month_of_birth
    - year_of_birth
    - street_number
    - street_name
    - unit_number
    - zipcode

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name_15

Iteration 1: Largest change in params was -0.181 in the m_probability of street_name, level `Exact match`
Iteration 2: Largest change in params was 0.0142 in the m_probability of street_name, level `All other comparisons`
Iteration 3: Largest change in params was 0.000879 in the m_probability of last_name_12, level `All other comparisons`
Iteration 4: Largest change in params was 0.000188 in the m_probability of last_name_12, level `All other comparisons`
Iteration 5: Largest change in pa

In [28]:
em_session_1.m_u_values_interactive_history_chart()

In [29]:
em_session_1.match_weights_interactive_history_chart()

In [30]:
blocking_rule_for_training = "l.middle_initial = r.middle_initial and l.last_name_12 = r.last_name_12"
em_session_2 = linker.estimate_parameters_using_expectation_maximisation(
    blocking_rule_for_training,
    # Fix lambda; u is fixed by default
    fix_probability_two_random_records_match=True,
)


----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.middle_initial = r.middle_initial and l.last_name_12 = r.last_name_12

Parameter estimates will be made for the following comparison(s):
    - first_name_15
    - day_of_birth
    - month_of_birth
    - year_of_birth
    - street_number
    - street_name
    - unit_number
    - zipcode

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - last_name_12
    - middle_initial

Iteration 1: Largest change in params was -0.0989 in the m_probability of first_name_15, level `Exact match`
Iteration 2: Largest change in params was -0.00354 in the m_probability of first_name_15, level `Exact match`
Iteration 3: Largest change in params was -0.000774 in the m_probability of year_of_birth, level `Exact match`
Iteration 4: Largest change in params was -0.000232 in the m_probability of year_of_birth, level `Exact match`
Iteration 5: L

In [31]:
em_session_2.m_u_values_interactive_history_chart()

In [32]:
linker.match_weights_chart()

In [33]:
linker.m_u_parameters_chart()

In [34]:
linker.parameter_estimate_comparisons_chart()

In [35]:
splink_settings = linker._settings_obj.as_dict()

## Implement matching passes

In [36]:
from dataclasses import dataclass

# Calculate this once to save time -- mapping from record_id to record_id_raw_input_file
# There can be multiple records (with different record_id) for the same input file record
# (record_id_raw_input_file) because of the handling of nicknames by creating extra records.
record_id_raw_input_file_by_record_id = census_2030.set_index('record_id').record_id_raw_input_file

all_piks = pd.concat([
    name_dob_reference_file[["record_id", "pik"]],
    geobase_reference_file[["record_id", "pik"]]
], ignore_index=True).set_index("record_id").pik

dates_of_death = (
    pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_census_numident.parquet')
        .set_index('pik')
        .date_of_death
        .pipe(lambda s: pd.to_datetime(s, format='%Y%m%d', errors='coerce'))
)

class PersonLinkageCascade:
    def __init__(self):
        # This dataframe will accumulate the PIKs to attach to the input file
        self.confirmed_piks = pd.DataFrame(columns=["record_id_raw_input_file", "pik"])
        self.current_module = None

    def start_module(self, *args, **kwargs):
        assert self.current_module is None or self.current_module.confirmed
        self.current_module = PersonLinkageModule(*args, **kwargs)

    def run_matching_pass(self, *args, **kwargs):
        self.current_module.run_matching_pass(*args, already_confirmed_piks=self.confirmed_piks, **kwargs)

    def confirm_piks(self, *args, **kwargs):
        # Make sure we are not about to confirm PIKs for any of the same files we have
        # already PIKed
        assert (
            set(self.current_module.provisional_links.record_id_raw_input_file) &
            set(self.confirmed_piks.record_id_raw_input_file)
        ) == set()

        newly_confirmed_piks = self.current_module.confirm_piks_from_provisional_links()

        self.confirmed_piks = pd.concat([
            self.confirmed_piks,
            newly_confirmed_piks,
        ], ignore_index=True)

        return self.confirmed_piks

@dataclass
class PersonLinkageModule:
    name: str
    reference_file: pd.DataFrame
    reference_file_name: str
    cut_columns: list[str]
    matching_columns: list[str]

    def __post_init__(self):
        self.provisional_links = pd.DataFrame(columns=["record_id_census_2030"])
        self.confirmed = False

    def run_matching_pass(
        self,
        pass_name,
        blocking_columns,
        probability_threshold=0.99,
        input_data_transformation=lambda x: x,
        already_confirmed_piks=pd.DataFrame(columns=["record_id_raw_input_file"]),
    ):
        assert self.confirmed == False

        print(f"Running {pass_name} of {self.name}")

        columns_needed = ["record_id"] + self.cut_columns + blocking_columns + self.matching_columns
        tables_for_splink = [
            prep_table_for_splink(self.reference_file, self.reference_file_name, columns_needed),
            prep_table_for_splink(
                census_2030[
                    # Only look for matches among records that have not received a confirmed PIK
                    ~census_2030.record_id_raw_input_file.isin(already_confirmed_piks.record_id_raw_input_file) &
                    # Only look for matches among records that have not received a provisional link
                    # NOTE: "records" here does not mean input file records -- a nickname record having
                    # a provisional link does not prevent a canonical name record from the same input record
                    # from continuing to match
                    ~census_2030.record_id.isin(self.provisional_links.record_id_census_2030)
                ].pipe(input_data_transformation),
                "census_2030",
                columns_needed,
            )
        ]
        print(f"Files to link are {len(tables_for_splink[0]):,.0f} and {len(tables_for_splink[1]):,.0f} records")

        blocking_rule_parts = [f"l.{col} = r.{col}" for col in self.cut_columns + blocking_columns]
        blocking_rule = " and ".join(blocking_rule_parts)

        # We base our Splink linker for this pass on the one we trained above,
        # but limiting it to the relevant column comparisons and updating the pass-specific
        # settings
        pass_splink_settings = copy.deepcopy(splink_settings)
        pass_splink_settings["comparisons"] = [
            c for c in pass_splink_settings["comparisons"] if c["output_column_name"] in self.matching_columns
        ]
        pass_splink_settings["probability_two_random_records_match"] = (
            probability_two_random_records_match(census_2030, self.reference_file)
        )
        pass_splink_settings["blocking_rules_to_generate_predictions"] = [blocking_rule]

        if splink_engine == 'duckdb':
            linker = DuckDBLinker(
                tables_for_splink,
                pass_splink_settings,
                # Must match order of tables_for_splink
                input_table_aliases=["reference_file", "census_2030"]
            )
        else:
            linker = SparkLinker(
                tables_for_splink,
                pass_splink_settings,
                # Must match order of tables_for_splink
                input_table_aliases=["reference_file", "census_2030"],
                spark=spark,
            )
    
        num_comparisons = linker.count_num_comparisons_from_blocking_rule(blocking_rule)
        print(f"Number of pairs that will be compared: {num_comparisons:,.0f}")
    
        # https://moj-analytical-services.github.io/splink/demos/06_Visualising_predictions.html#comparison-viewer-dashboard
        # We also include some pairs below the threshold, for additional context.
        pairs_worth_inspecting = linker.predict(threshold_match_probability=probability_threshold - 0.2)
    
        dashboard_file_name = f"splink_temp/{self.name.replace(' ', '_')}__{pass_name.replace(' ', '_')}.html"
        linker.comparison_viewer_dashboard(pairs_worth_inspecting, dashboard_file_name, overwrite=True)
        from IPython.display import IFrame, display
        display(IFrame(
            src=f"./{dashboard_file_name}", width="100%", height=1200
        ))
    
        new_provisional_links = (
            pairs_worth_inspecting
                .as_pandas_dataframe()
                .pipe(lambda df: df[df.match_probability >= probability_threshold])
        )
        if len(new_provisional_links) > 0:
            new_provisional_links = label_pairs_with_dataset(new_provisional_links)
            new_provisional_links["record_id_raw_input_file"] = (
                new_provisional_links.record_id_census_2030.map(record_id_raw_input_file_by_record_id)
            )
        
            self.provisional_links = pd.concat([
                self.provisional_links,
                new_provisional_links.assign(module_name=self.name, pass_name=pass_name)
            ], ignore_index=True)
    
        still_eligible = (
            (~census_2030.record_id_raw_input_file.isin(already_confirmed_piks.record_id_raw_input_file)) &
            (~census_2030.record_id.isin(self.provisional_links.record_id_census_2030))
        )
        print(f'Matched {len(new_provisional_links)} records; {still_eligible.mean():.2%} still eligible to match')

    def confirm_piks_from_provisional_links(self):
        assert not self.confirmed

        provisional_links = self.provisional_links
        provisional_links["pik"] = provisional_links.record_id_reference_file.map(all_piks)

        # "After the initial set of links is created in GeoSearch, a post-search program is run to determine
        # which of the links are retained. A series of checks are performed: First the date of death
        # information from the Numident is checked and links to deceased persons are dropped. Next a
        # check is made for more than one SSN assigned to a source record. If more than one SSN is
        # assigned, the best link is selected based on match weights. If no best SSN is determined, all SSNs
        # assigned in the GeoSearch module are dropped and the input record cascades to the next
        # module. A similar post-search program is run at the end of all search modules."
        # - Layne et al. p. 5

        # Drop links to deceased people
        # NOTE: On p. 38 of Brown et al. (2023) it discusses at length the number of PVS matches to deceased
        # people, which should not be possible based on this process.
        # Even though this is more recent, I can't think of a reason why this check would have
        # been *removed* from PVS -- can we chalk this up to something experimental they were doing for
        # the AR Census in the 2023 report?
        link_dates_of_death = provisional_links["pik"].map(dates_of_death)
        # Census day 2030
        deceased_links = link_dates_of_death <= pd.to_datetime("2030-04-01")
        print(f'{deceased_links.sum()} input records linked to deceased people, dropping links')
        provisional_links = provisional_links[~deceased_links]

        # Check for multiple linkage to a single input file record
        max_probability = provisional_links.groupby("record_id_raw_input_file").match_probability.max()
        piks_per_input_file = (
            provisional_links.groupby("record_id_raw_input_file")
                .apply(lambda df: df[df.match_probability == max_probability[df.name]].pik.nunique())
        )

        multiple_piks = piks_per_input_file[piks_per_input_file > 1].index
        print(f'{len(multiple_piks)} input records linked to multiple PIKs, dropping links')
        provisional_links = (
            provisional_links[~provisional_links.record_id_raw_input_file.isin(multiple_piks)]
                .sort_values("match_probability")
                .groupby("record_id_raw_input_file")
                .last()
                .reset_index()
        )

        assert (provisional_links.groupby("record_id_raw_input_file").pik.nunique() == 1).all()

        self.confirmed = True
        self.provisional_links = None
        
        return (
            provisional_links[[
                "record_id_raw_input_file",
                "record_id_census_2030",
                "record_id_reference_file",
                "pik",
                "module_name",
                "pass_name",
                "match_probability",
            ]]
        )

def label_pairs_with_dataset(pairs):
    # Name the columns according to the datasets, not "r" (right) and "l" (left)
    for suffix in ["r", "l"]:
        pairs = (
            pairs.groupby(f'source_dataset_{suffix}')
                .apply(lambda df_g: replace_suffix_with_source_dataset(df_g, suffix, df_g.name))
        )

    return pairs

def replace_suffix_with_source_dataset(df, suffix, source_dataset):
    return df.rename(columns=lambda c: re.sub(f'_{suffix}$', f'_{source_dataset}', c))

In [37]:
person_linkage_cascade = PersonLinkageCascade()

# Verification Module

Wagner and Layne, p.14:

> If the input file has a SSN data field, it first goes through the verification process.

This module of PVS only runs on input records with an SSN.
No records in the CUF have an SSN (SSN is not collected on the decennial)
so this module is skipped.

# GeoSearch

Brown et al. p. 29:

> the GeoSearch module blocks records at
different levels of geography, starting with the housing unit level, then broadening geography
up to the three-digit ZIP Code level. The typical matching variables are name, DOB, sex, and
various address fields.

Wagner and Layne p. 14:

> The typical GeoSearch blocking strategy starts with blocking records at the
household level, then broadens the geography for each successive pass and ends at
blocking by the first three digits of the zip code. The typical match variables are first,
middle, and last names; generational suffix; date of birth; gender and various address
fields.
> 
> The data for the GeoSearch module are split into 1,000 cuts based on the first three
digits of the zip code (zip3) for record. The GeoSearch program works on one zip3 cut
at a time, with shell scripts submitting multiple streams of cuts to the system. This
allows for parallel processing and restart capability.

Layne, Wagner, and Rothhaas list an exact series of passes with blocking and matching variables
for each (Appendix A on p. 25).
This was the series of passes they used in running PVS on the Medicare Enrollment
Database, the Indian Health Service Patient Registration File, and on two commercial
data sources.
All of these data sources are rather different than a CUF, and this paper is from 2014,
but it is the only place I know of where an entire set of passes is described,
so I have used the exact same passes here.

In [38]:
person_linkage_cascade.start_module(
    name="geosearch",
    reference_file=geobase_reference_file,
    reference_file_name="geobase_reference_file",
    cut_columns=["zip3"],
    matching_columns=[
        "first_name_15",
        "last_name_12",
        "middle_initial",
        "day_of_birth",
        "month_of_birth",
        "year_of_birth",
        "street_number",
        "street_name",
        "unit_number",
        "zipcode",
    ],
)

## Pass 1: Block on geokey (entire address)

In [39]:
person_linkage_cascade.run_matching_pass(
    pass_name="geokey",
    blocking_columns=["geokey"],
)

Running geokey of geosearch
Files to link are 32,825 and 11,030 records
Number of pairs that will be compared: 634,824


Matched 8462 records; 27.03% still eligible to match


## Pass 2: Block on geokey (entire address), switching first and last names

In [40]:
def switch_first_and_last_names(df):
    return (
        df.rename(columns={"first_name": "last_name", "last_name": "first_name"})
            # Re-calculate the truncated versions of first and last.
            # NOTE: It is not necessary to re-calculate the phonetic versions, because
            # those are never used in any pass that has a name switch.
            .pipe(add_truncated_name_cols)
    )

In [41]:
# We don't actually have any swapping of first and last names in pseudopeople
person_linkage_cascade.run_matching_pass(
    pass_name="geokey name switch",
    blocking_columns=["geokey"],
    input_data_transformation=switch_first_and_last_names,
)

Running geokey name switch of geosearch
Files to link are 32,825 and 2,981 records
Number of pairs that will be compared: 19,585


Matched 0 records; 27.03% still eligible to match


## Pass 3: Block on house number and street name Soundex

In [42]:
person_linkage_cascade.run_matching_pass(
    pass_name="house number and street name Soundex",
    blocking_columns=["street_number", "street_name_soundex"],
)

Running house number and street name Soundex of geosearch
Files to link are 32,825 and 2,981 records
Number of pairs that will be compared: 116,674


Matched 1106 records; 19.68% still eligible to match


## Pass 4: Block on house number and street name Soundex, switching first and last names

In [43]:
person_linkage_cascade.run_matching_pass(
    pass_name="house number and street name Soundex name switch",
    blocking_columns=["street_number", "street_name_soundex"],
    input_data_transformation=switch_first_and_last_names,
)

Running house number and street name Soundex name switch of geosearch
Files to link are 32,825 and 2,171 records
Number of pairs that will be compared: 25,251


Matched 0 records; 19.68% still eligible to match


## Pass 5: Block on some name and DOB information

In [44]:
person_linkage_cascade.run_matching_pass(
    pass_name="some name and DOB information",
    blocking_columns=["first_name_2", "last_name_2", "year_of_birth"],
)

Running some name and DOB information of geosearch
Files to link are 32,825 and 2,171 records
Number of pairs that will be compared: 2,301


Matched 2162 records; 8.94% still eligible to match


## Post-process and confirm PIKs

In [45]:
person_linkage_cascade.confirm_piks()

137 input records linked to deceased people, dropping links
11 input records linked to multiple PIKs, dropping links


Unnamed: 0,record_id_raw_input_file,pik,record_id_census_2030,record_id_reference_file,module_name,pass_name,match_probability
0,simulated_census_2030_0,89484,census_2030_preprocessed_0,simulated_geobase_reference_file_951,geosearch,geokey,1.0
1,simulated_census_2030_1,98736,census_2030_preprocessed_1,simulated_geobase_reference_file_17348,geosearch,geokey,1.0
2,simulated_census_2030_10,94481,census_2030_preprocessed_10,simulated_geobase_reference_file_9789,geosearch,geokey,1.0
3,simulated_census_2030_100,100835,census_2030_preprocessed_100,simulated_geobase_reference_file_21247,geosearch,some name and DOB information,1.0
4,simulated_census_2030_1000,93179,census_2030_preprocessed_1000,simulated_geobase_reference_file_7496,geosearch,geokey,1.0
...,...,...,...,...,...,...,...
9933,simulated_census_2030_9994,100273,census_2030_preprocessed_9994,simulated_geobase_reference_file_20169,geosearch,some name and DOB information,1.0
9934,simulated_census_2030_9996,95280,census_2030_preprocessed_9996,simulated_geobase_reference_file_11208,geosearch,house number and street name Soundex,1.0
9935,simulated_census_2030_9997,101556,census_2030_preprocessed_9997,simulated_geobase_reference_file_22666,geosearch,geokey,1.0
9936,simulated_census_2030_9998,98532,census_2030_preprocessed_9998,simulated_geobase_reference_file_17011,geosearch,geokey,1.0


In [46]:
person_linkage_cascade.confirmed_piks.groupby(["module_name", "pass_name"]).size().sort_values(ascending=False)

module_name  pass_name                           
geosearch    geokey                                  7966
             some name and DOB information           1170
             house number and street name Soundex     802
dtype: int64

# ZIP3 Adjacency Search

Layne, Wagner, and Rothhaas (p. 6) refers to ZIP3 adjacency as a module and even provides
a list of two passes within it in Appendix A.

Alexander et al. (p. 6) refers to it the same way, as a module.

However, Wagner and Layne (p. 14) says:

> The GeoSearch module also incorporates the adjacency of neighboring areas with
different zip3 values

which implies this logic is a pass or passes *within* GeoSearch (and due to how cascading works,
this is not just an academic distinction).

There is no mention at all of ZIP3 adjacency in the Brown et al. paper from 2023.

**TODO**: Determine whether this still exists, how it works if so, and implement it.
Doesn't really make sense with the sample data since there is only one ZIP.

# Movers

The Movers module is not mentioned in Brown et al, nor in Wagner and Layne.

In Layne, Wagner, and Rothhaas (p. 7), published in 2014, it is described as "Prototyped only – not implemented in PVS":

> The Movers module is appropriate for input files that combine individuals together into
> households.
> The module seeks multiple members of an input household in one address that
> may have moved together to another address.
> To be eligible for this module the household size must be greater than one.
> This module is being tested and was not used in the analysis for this paper.

Alexander et al. (p. 6) a year later makes no note of it being experimental:

> To be eligible for the Movers Module, no member of the household can have a
> PIK and the household must consist of more than one member.
> This module considers persons living at the same address as a unit and
> searches for matching units living together in the reference file (without regard for address).

Essentially the same text is included in Massey et al., published in 2018.

Though it doesn't provide much information, [this document](https://gunnisonconsulting.com/docs/pdf/Record%20Linkage%20Slicksheet_FINAL.pdf)
([archive](https://web.archive.org/web/20230705010237/https://gunnisonconsulting.com/docs/pdf/Record%20Linkage%20Slicksheet_FINAL.pdf))
from the Gunnison consulting group states that they worked with Census and
"pioneered the transition from a record-by-record matching approach to an
approach that better uses groups of records, such as full households, to make more
and better links."
Specifically, they claim that 16% of previously unlinked records were linked using
the household-level approach.

**TODO**: Haven't figured out if this is still in use, nor how it works.
This module seems of particular interest, because it sounds like it is quite different
from other modules.

# NameSearch

Brown et al., p. 29:

> ... the NameSearch module, which uses only name and DOB fields, comparing all combinations of
alternate names and dates of birth.

Wagner and Layne, p. 15:

> The NameSearch module searches the reference files for records failing the
Verification and GeoSearch Modules. Only name and date of birth data are used in this
search process. NameSearch consists of multiple passes against the Numident Name
Reference file, which contains all possible combinations of alternate names and
alternate dates of birth for each SSN in the Census Numident file, and includes data for
ITINs.
> 
> The typical NameSearch blocking strategy starts with a strict first pass, blocking
records by exact date of birth and parts of names. Successive passes block on parts of
the name and date of birth fields to allow for some name and date of birth variation.
The typical match variables are first, middle and last names, generational suffix, date of
birth, and gender.

As in GeoSearch, the only full listing I could find of passes with blocking and matching variables
was in Layne, Wagner, and Rothhaas.
This is probably somewhat out of date and is not what would be used on a CUF, but I have
copied it exactly here.

In [47]:
person_linkage_cascade.start_module(
    name="namesearch",
    reference_file=name_dob_reference_file,
    reference_file_name="name_dob_reference_file",
    cut_columns=["first_initial_cut", "last_initial_cut"],
    matching_columns=[
        "first_name_15",
        "last_name_12",
        "middle_initial",
        "day_of_birth",
        "month_of_birth",
        "year_of_birth",
    ],
)

## Pass 1: Block on DOB and NYSIIS of name

In [48]:
person_linkage_cascade.run_matching_pass(
    pass_name="DOB and NYSIIS of name",
    blocking_columns=["day_of_birth", "month_of_birth", "year_of_birth", "first_name_nysiis", "last_name_nysiis"],
)

Running DOB and NYSIIS of name of namesearch
Files to link are 19,875 and 1,089 records
Number of pairs that will be compared: 256


Matched 256 records; 7.55% still eligible to match


## Pass 2: Block on DOB and initials

In [49]:
person_linkage_cascade.run_matching_pass(
    pass_name="DOB and initials",
    blocking_columns=["day_of_birth", "month_of_birth", "year_of_birth", "first_name_1", "last_name_1"],
)

Running DOB and initials of namesearch
Files to link are 19,875 and 833 records
Number of pairs that will be compared: 94


Matched 84 records; 6.79% still eligible to match


## Pass 3: Block on year of birth and first two characters of name

In [50]:
person_linkage_cascade.run_matching_pass(
    pass_name="year of birth and first two characters of name",
    blocking_columns=["year_of_birth", "first_name_2", "last_name_2"],
)

Running year of birth and first two characters of name of namesearch
Files to link are 19,875 and 749 records
Number of pairs that will be compared: 34


Matched 6 records; 6.74% still eligible to match


## Pass 4: Block on birthday and first two characters of name

In [51]:
person_linkage_cascade.run_matching_pass(
    pass_name="birthday and first two characters of name",
    blocking_columns=["day_of_birth", "month_of_birth", "first_name_2", "last_name_2"],
)

Running birthday and first two characters of name of namesearch
Files to link are 19,875 and 743 records
Number of pairs that will be compared: 39


Matched 31 records; 6.46% still eligible to match


## Post-process and confirm PIKs

In [52]:
person_linkage_cascade.confirm_piks()

69 input records linked to deceased people, dropping links
0 input records linked to multiple PIKs, dropping links


Unnamed: 0,record_id_raw_input_file,pik,record_id_census_2030,record_id_reference_file,module_name,pass_name,match_probability
0,simulated_census_2030_0,89484,census_2030_preprocessed_0,simulated_geobase_reference_file_951,geosearch,geokey,1.000000
1,simulated_census_2030_1,98736,census_2030_preprocessed_1,simulated_geobase_reference_file_17348,geosearch,geokey,1.000000
2,simulated_census_2030_10,94481,census_2030_preprocessed_10,simulated_geobase_reference_file_9789,geosearch,geokey,1.000000
3,simulated_census_2030_100,100835,census_2030_preprocessed_100,simulated_geobase_reference_file_21247,geosearch,some name and DOB information,1.000000
4,simulated_census_2030_1000,93179,census_2030_preprocessed_1000,simulated_geobase_reference_file_7496,geosearch,geokey,1.000000
...,...,...,...,...,...,...,...
10241,simulated_census_2030_9880,107581,census_2030_preprocessed_9880,simulated_name_dob_reference_file_18635,namesearch,DOB and initials,0.999998
10242,simulated_census_2030_9903,92805,census_2030_preprocessed_9903,simulated_name_dob_reference_file_3859,namesearch,DOB and initials,0.996331
10243,simulated_census_2030_991,97990,census_2030_preprocessed_991,simulated_name_dob_reference_file_9044,namesearch,DOB and NYSIIS of name,1.000000
10244,simulated_census_2030_9965,104315,census_2030_preprocessed_9965,simulated_name_dob_reference_file_15369,namesearch,DOB and NYSIIS of name,1.000000


In [53]:
person_linkage_cascade.confirmed_piks.groupby(["module_name", "pass_name"]).size().sort_values(ascending=False)

module_name  pass_name                                     
geosearch    geokey                                            7966
             some name and DOB information                     1170
             house number and street name Soundex               802
namesearch   DOB and NYSIIS of name                             205
             DOB and initials                                    68
             birthday and first two characters of name           31
             year of birth and first two characters of name       4
dtype: int64

# DOBSearch

Brown et al., p. 29:

> ... the DOBSearch module, which blocks on
month and day of birth, then compares name, sex, and DOB.

Wagner and Layne, p. 15:

> The DOBSearch module searches the reference files for the records that fail the
> NameSearch, using name and date of birth data. The module matches against a re-split
> version of the Numident Name Reference file, splitting the data based on month and day
> of birth.
> 
> There are typically four blocking passes in the DOBSearch module. The first pass
> blocks records by first name in the incoming file to last name in the DOB Reference file
> and last name in incoming file to first name in the DOB Reference file. This strategy
> accounts for switching of first and last name in the incoming file.

Again, I have used the exact passes listed for this module in Layne, Wagner, and Rothhaas.
This is somewhat corroborated to be "typical" given that it is consistent with Wagner and Layne
above, but both papers are old.

In [54]:
person_linkage_cascade.start_module(
    name="dobsearch",
    reference_file=name_dob_reference_file,
    reference_file_name="name_dob_reference_file",
    cut_columns=["day_of_birth", "month_of_birth"],
    matching_columns=[
        "first_name_15",
        "last_name_12",
        "middle_initial",
        "day_of_birth",
        "month_of_birth",
        "year_of_birth",
    ],
)

## Pass 1: Block on initials, switching first and last names

In [55]:
person_linkage_cascade.run_matching_pass(
    pass_name="initials name switch",
    blocking_columns=["first_name_1", "last_name_1"],
    input_data_transformation=switch_first_and_last_names,
)

Running initials name switch of dobsearch
Files to link are 19,875 and 781 records
Number of pairs that will be compared: 95


Matched 0 records; 7.08% still eligible to match


## Pass 2: Block on first three characters of name

In [56]:
person_linkage_cascade.run_matching_pass(
    pass_name="first three characters of name",
    blocking_columns=["first_name_3", "last_name_3"],
)

Running first three characters of name of dobsearch
Files to link are 19,875 and 781 records
Number of pairs that will be compared: 61


Matched 59 records; 6.55% still eligible to match


## Pass 3: Block on reverse Soundex of name

In [57]:
person_linkage_cascade.run_matching_pass(
    pass_name="reverse Soundex of name",
    blocking_columns=["first_name_reverse_soundex", "last_name_reverse_soundex"],
)

Running reverse Soundex of name of dobsearch
Files to link are 19,875 and 722 records
Number of pairs that will be compared: 33


Matched 31 records; 6.26% still eligible to match


## Pass 4: Block on first two characters of first name and year of birth

In [58]:
person_linkage_cascade.run_matching_pass(
    pass_name="first two characters of first name and year of birth",
    blocking_columns=["first_name_2", "year_of_birth"],
)

Running first two characters of first name and year of birth of dobsearch
Files to link are 19,875 and 691 records
Number of pairs that will be compared: 97


Matched 82 records; 5.52% still eligible to match


## Post-process and confirm PIKs

In [59]:
person_linkage_cascade.confirm_piks()

74 input records linked to deceased people, dropping links
0 input records linked to multiple PIKs, dropping links


Unnamed: 0,record_id_raw_input_file,pik,record_id_census_2030,record_id_reference_file,module_name,pass_name,match_probability
0,simulated_census_2030_0,89484,census_2030_preprocessed_0,simulated_geobase_reference_file_951,geosearch,geokey,1.000000
1,simulated_census_2030_1,98736,census_2030_preprocessed_1,simulated_geobase_reference_file_17348,geosearch,geokey,1.000000
2,simulated_census_2030_10,94481,census_2030_preprocessed_10,simulated_geobase_reference_file_9789,geosearch,geokey,1.000000
3,simulated_census_2030_100,100835,census_2030_preprocessed_100,simulated_geobase_reference_file_21247,geosearch,some name and DOB information,1.000000
4,simulated_census_2030_1000,93179,census_2030_preprocessed_1000,simulated_geobase_reference_file_7496,geosearch,geokey,1.000000
...,...,...,...,...,...,...,...
10338,simulated_census_2030_9668,97984,census_2030_preprocessed_9668,simulated_name_dob_reference_file_9038,dobsearch,first two characters of first name and year of...,0.999997
10339,simulated_census_2030_982,106187,census_2030_preprocessed_982,simulated_name_dob_reference_file_17241,dobsearch,first two characters of first name and year of...,0.997245
10340,simulated_census_2030_9822,104578,census_2030_preprocessed_9822,simulated_name_dob_reference_file_15632,dobsearch,first two characters of first name and year of...,0.996331
10341,simulated_census_2030_9865,101950,census_2030_preprocessed_9865,simulated_name_dob_reference_file_13004,dobsearch,first two characters of first name and year of...,0.999834


In [60]:
person_linkage_cascade.confirmed_piks.groupby(["module_name", "pass_name"]).size().sort_values(ascending=False)

module_name  pass_name                                           
geosearch    geokey                                                  7966
             some name and DOB information                           1170
             house number and street name Soundex                     802
namesearch   DOB and NYSIIS of name                                   205
dobsearch    first two characters of first name and year of birth      72
namesearch   DOB and initials                                          68
             birthday and first two characters of name                 31
dobsearch    reverse Soundex of name                                   25
namesearch   year of birth and first two characters of name             4
dtype: int64

# HHCompSearch

Brown et al., p. 29:

> ... the Household Composition search module, which requires at least one
> person in the household of the unmatched person to have received a PIK.
> The full set of unmatched records with historical name, DOB, sex, and address data from
> households whose members with PIKs were observed in the past is created. The module
> attempts to find a match to this universe based on name and DOB information.

Wagner and Layne, p. 16:

> For persons with a PIK in the eligible household, all of the geokeys from the PVS
> GeoBase are extracted for each of these PIKs. The geokeys are unduplicated and all
> persons are selected from the PVS GeoBase with these geokeys. Next, the program
> removes all household members with a PIK, leaving the unPIKed persons in the
> household. This becomes the reference file to search against. There are typically two
> passes in this module. Records are blocked by MAFID, name, date of birth, and gender.

I don't know of any list of passes in this module.
Blocking on "MAFID, name, date of birth, and gender" seems like a lot for only
two passes, especially given that we are only matching people within each eligible
household.
I'm actually a bit surprised there is any blocking at all, given how restrictive this
module inherently is.
I've split the difference here by creating two passes that are very permissive.

## Create the reference file

In [61]:
# TODO: As of now in pseudopeople, our only indicator in the Census data of household
# is the geokey itself. This can be messed up by noise, so we should switch to using
# a (presumably low-noise) household indicator when we have that.
household_id_approximation = (
    census_2030[["geokey"]].drop_duplicates().reset_index()
        .rename(columns={"index": "household_id"})
        .set_index("geokey").household_id
)
census_2030["household_id"] = census_2030.geokey.map(household_id_approximation)

piks_with_household = (
    census_2030[["household_id", "record_id_raw_input_file"]]
        .merge(person_linkage_cascade.confirmed_piks, on="record_id_raw_input_file", how="left")
)
someone_piked = piks_with_household[piks_with_household.pik.notnull()].groupby("household_id").pik.nunique() > 0
someone_unpiked = piks_with_household[piks_with_household.pik.isnull()].groupby("household_id").pik.nunique() > 0

In [62]:
eligible_households = someone_piked & someone_unpiked

In [63]:
piks_by_household = piks_with_household[["household_id", "pik"]].dropna(subset="pik").drop_duplicates()
piks_by_household

Unnamed: 0,household_id,pik
0,0,89484
1,1,98736
2,1,91258
3,3,90622
4,4,96379
...,...,...
11020,11023,105205
11021,11024,95963
11023,11026,104164
11024,11027,106182


In [64]:
geokeys_by_household = (
    piks_by_household
        .merge(geobase_reference_file[["pik", "geokey"]].dropna(subset="geokey"), on="pik")
        .drop(columns=["pik"])
        .drop_duplicates()
)
geokeys_by_household

Unnamed: 0,household_id,geokey
0,0,1130 MALLORY LN ANYTOWN WA 00000
1,1,32597 DELACORTE DR ANYTOWN WA 00000
2,1,32597 DELACORTE DR ANYTOWN WA 0000O
4,3,4458 WINDSOR PL ANYTOWN CT 00000
5,3,4458 WINDSOR PL ANYTOWN WA 00000
...,...,...
16787,11024,211 QUIET WAY ANYTOWN WA 00000
16788,11026,24113 LAUDER ANYTOWN WA 00000
16789,11027,1534 BENTLEY DR ANYTOWN WA 00080
16790,11027,1534 BENTLEY DR ANYTOVVN WA 00000


In [65]:
records_to_search_by_household = geokeys_by_household.merge(geobase_reference_file, on="geokey")
records_to_search_by_household

Unnamed: 0,household_id,geokey,record_id,ssn,first_name,middle_name,last_name,street_number,street_name,unit_number,...,last_name_12,middle_initial,first_name_1,last_name_1,first_name_2,last_name_2,first_name_3,last_name_3,street_name_soundex,zip3
0,0,1130 MALLORY LN ANYTOWN WA 00000,simulated_geobase_reference_file_951,426-70-9610,Grrald,Robert,Allen,1130,MALLORY LN,,...,Allen,R,G,A,Gr,Al,Grr,All,M464,000
1,0,1130 MALLORY LN ANYTOWN WA 00000,simulated_geobase_reference_file_17151,021-93-1396,Briana,Jessica,Rodriguez,1130,MALLORY LN,,...,Rodriguez,J,B,R,Br,Ro,Bri,Rod,M464,000
2,0,1130 MALLORY LN ANYTOWN WA 00000,simulated_geobase_reference_file_27855,649-45-5652,Levi,Holden,Rodriguez,1130,MALLORY LN,,...,Rodriguez,H,L,R,Le,Ro,Lev,Rod,M464,000
3,5,1130 MALLORY LN ANYTOWN WA 00000,simulated_geobase_reference_file_951,426-70-9610,Grrald,Robert,Allen,1130,MALLORY LN,,...,Allen,R,G,A,Gr,Al,Grr,All,M464,000
4,5,1130 MALLORY LN ANYTOWN WA 00000,simulated_geobase_reference_file_17151,021-93-1396,Briana,Jessica,Rodriguez,1130,MALLORY LN,,...,Rodriguez,J,B,R,Br,Ro,Bri,Rod,M464,000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210984,11026,24113 LAUDER ANYTOWN WA 00000,simulated_geobase_reference_file_26297,646-22-4854,Ember,Hayden,Samuels,24113,LAUDER,,...,Samuels,H,E,S,Em,Sa,Emb,Sam,L360,000
210985,11027,1534 BENTLEY DR ANYTOWN WA 00080,simulated_geobase_reference_file_29077,401-82-0897,Athena,Valery,Deshpande,1534,BENTLEY DR,,...,Deshpande,V,A,D,At,De,Ath,Des,B534,000
210986,11027,1534 BENTLEY DR ANYTOVVN WA 00000,simulated_geobase_reference_file_29078,401-82-0897,Athena,Valery,Deshpande,1534,BENTLEY DR,,...,Deshpande,V,A,D,At,De,Ath,Des,B534,000
210987,11027,1534 BENTLEY DR ANYTOWN WA 00000,simulated_geobase_reference_file_29079,401-82-0897,Athena,Valery,Deshpande,1534,BENTLEY DR,,...,Deshpande,V,A,D,At,De,Ath,Des,B534,000


In [66]:
# Apparently, we exclude from the reference file all *reference file* records with a PIK that
# has already been assigned to an input file row.
# Doing this goes against the normal assumption, which is that reference file records can match
# to multiple input file records.
# This is really surprising to me, but it seems clear from "the program
# removes all household members with a PIK, leaving the unPIKed persons in the
# household. This becomes the reference file to search against." (Wagner and Layne, p. 16)
hhcomp_reference_file = records_to_search_by_household[~records_to_search_by_household.pik.isin(person_linkage_cascade.confirmed_piks.pik)]
hhcomp_reference_file

Unnamed: 0,household_id,geokey,record_id,ssn,first_name,middle_name,last_name,street_number,street_name,unit_number,...,last_name_12,middle_initial,first_name_1,last_name_1,first_name_2,last_name_2,first_name_3,last_name_3,street_name_soundex,zip3
29,1,1307 ROSEWOOD AVE ANYTOWN WA 00000,simulated_geobase_reference_file_11427,442-50-8185,Micnelle,Mary,Jozsa,1307,ROSEWOOD AVE,,...,Jozsa,M,M,J,Mi,Jo,Mic,Joz,R231,000
30,1,1307 ROSEWOOD AVE ANYTOWN WA 00000,simulated_geobase_reference_file_24654,819-28-2549,Analise,Scarlet,Jozsa,1307,ROSEWOOD AVE,,...,Jozsa,S,A,J,An,Jo,Ana,Joz,R231,000
31,1,1307 ROSEWOOD AVE ANYTOWN WA 00000,simulated_geobase_reference_file_25616,430-18-9346,Leo,Henry,Jozsa,1307,ROSEWOOD AVE,,...,Jozsa,H,L,J,Le,Jo,Leo,Joz,R231,000
32,1,1307 ROSEWOOD AVE ANYTOWN WA 00000,simulated_geobase_reference_file_25757,149-18-3448,Everly,Sahara,Jozsa,1307,ROSEWOOD AVE,,...,Jozsa,S,E,J,Ev,Jo,Eve,Joz,R231,000
38,4,3613 GRAND AVE ANYTOWN WA 00000,simulated_geobase_reference_file_9393,678-53-1652,Jeanette,Jennifer,Fisher,3613,GRAND AVE,,...,Fisher,J,J,F,Je,Fi,Jea,Fis,G653,000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210912,10985,3016 W WISCONSIN AV ANYTOWN WA 00000,simulated_geobase_reference_file_28961,298-48-2311,Sophia,Jaylah,Guillermo Rodriguez,3016,W WISCONSIN AV,,...,Guillermo Ro,J,S,G,So,Gu,Sop,Gui,W252,000
210940,11002,110-11 MAUDS HUGHES RD UNIT 1 ANYTOWN WA 00000,simulated_geobase_reference_file_6829,249-59-8645,Andrea,Paula,Hudson,110-11,MAUDS HUGHES RD,UNIT 1,...,Hudson,P,A,H,An,Hu,And,Hud,M322,000
210943,4,2559 NE 81ST TER # 8 FL ANYTOWN WA 00000,simulated_geobase_reference_file_15537,405-76-2432,Michael,Matthew,Nott,2559,NE 81ST TER,# 8 FL,...,Nott,M,M,N,Mi,No,Mic,Not,N233,000
210970,11013,3119 SOUTH 10TH AVENUE ANYTOWN WA 00000,simulated_geobase_reference_file_31170,011-84-1135,Amber,Katina,Clements,3119,SOUTH 10TH AVENUE,,...,Clements,K,A,C,Am,Cl,Amb,Cle,S331,000


In [67]:
person_linkage_cascade.start_module(
    name="hhcompsearch",
    reference_file=hhcomp_reference_file,
    reference_file_name="hhcomp_reference_file",
    cut_columns=["household_id"],
    matching_columns=[
        "first_name_15",
        "last_name_12",
        "middle_initial",
        "day_of_birth",
        "month_of_birth",
        "year_of_birth",
    ],
)

## Pass 1: Block on initials

In [68]:
person_linkage_cascade.run_matching_pass(
    pass_name="initials",
    blocking_columns=["first_name_1", "last_name_1"],
)

Running initials of hhcompsearch
Files to link are 56,686 and 683 records
Number of pairs that will be compared: 738


Matched 38 records; 5.88% still eligible to match


## Pass 2: Block on year of birth

In [69]:
person_linkage_cascade.run_matching_pass(
    pass_name="year of birth",
    blocking_columns=["year_of_birth"],
)

Running year of birth of hhcompsearch
Files to link are 56,686 and 649 records
Number of pairs that will be compared: 1,025


Matched 7 records; 5.82% still eligible to match


## Post-process and confirm PIKs

In [70]:
person_linkage_cascade.confirm_piks()

39 input records linked to deceased people, dropping links
0 input records linked to multiple PIKs, dropping links


Unnamed: 0,record_id_raw_input_file,pik,record_id_census_2030,record_id_reference_file,module_name,pass_name,match_probability
0,simulated_census_2030_0,89484,census_2030_preprocessed_0,simulated_geobase_reference_file_951,geosearch,geokey,1.000000
1,simulated_census_2030_1,98736,census_2030_preprocessed_1,simulated_geobase_reference_file_17348,geosearch,geokey,1.000000
2,simulated_census_2030_10,94481,census_2030_preprocessed_10,simulated_geobase_reference_file_9789,geosearch,geokey,1.000000
3,simulated_census_2030_100,100835,census_2030_preprocessed_100,simulated_geobase_reference_file_21247,geosearch,some name and DOB information,1.000000
4,simulated_census_2030_1000,93179,census_2030_preprocessed_1000,simulated_geobase_reference_file_7496,geosearch,geokey,1.000000
...,...,...,...,...,...,...,...
10344,simulated_census_2030_2161,89295,census_2030_preprocessed_2161,simulated_geobase_reference_file_608,hhcompsearch,year of birth,0.995957
10345,simulated_census_2030_3395,106196,census_2030_preprocessed_3395,simulated_geobase_reference_file_29096,hhcompsearch,year of birth,0.995957
10346,simulated_census_2030_5275,105747,census_2030_preprocessed_5275,simulated_geobase_reference_file_28463,hhcompsearch,year of birth,0.995957
10347,simulated_census_2030_6921,103290,census_2030_preprocessed_6921,simulated_geobase_reference_file_25197,hhcompsearch,year of birth,0.995957


In [71]:
person_linkage_cascade.confirmed_piks.groupby(["module_name", "pass_name"]).size().sort_values(ascending=False)

module_name   pass_name                                           
geosearch     geokey                                                  7966
              some name and DOB information                           1170
              house number and street name Soundex                     802
namesearch    DOB and NYSIIS of name                                   205
dobsearch     first two characters of first name and year of birth      72
namesearch    DOB and initials                                          68
              birthday and first two characters of name                 31
dobsearch     reverse Soundex of name                                   25
hhcompsearch  year of birth                                              6
namesearch    year of birth and first two characters of name             4
dtype: int64

In [72]:
person_linkage_cascade.confirmed_piks

Unnamed: 0,record_id_raw_input_file,pik,record_id_census_2030,record_id_reference_file,module_name,pass_name,match_probability
0,simulated_census_2030_0,89484,census_2030_preprocessed_0,simulated_geobase_reference_file_951,geosearch,geokey,1.000000
1,simulated_census_2030_1,98736,census_2030_preprocessed_1,simulated_geobase_reference_file_17348,geosearch,geokey,1.000000
2,simulated_census_2030_10,94481,census_2030_preprocessed_10,simulated_geobase_reference_file_9789,geosearch,geokey,1.000000
3,simulated_census_2030_100,100835,census_2030_preprocessed_100,simulated_geobase_reference_file_21247,geosearch,some name and DOB information,1.000000
4,simulated_census_2030_1000,93179,census_2030_preprocessed_1000,simulated_geobase_reference_file_7496,geosearch,geokey,1.000000
...,...,...,...,...,...,...,...
10344,simulated_census_2030_2161,89295,census_2030_preprocessed_2161,simulated_geobase_reference_file_608,hhcompsearch,year of birth,0.995957
10345,simulated_census_2030_3395,106196,census_2030_preprocessed_3395,simulated_geobase_reference_file_29096,hhcompsearch,year of birth,0.995957
10346,simulated_census_2030_5275,105747,census_2030_preprocessed_5275,simulated_geobase_reference_file_28463,hhcompsearch,year of birth,0.995957
10347,simulated_census_2030_6921,103290,census_2030_preprocessed_6921,simulated_geobase_reference_file_25197,hhcompsearch,year of birth,0.995957


# Resulting PIKs

In [73]:
pik_values = (
    person_linkage_cascade.confirmed_piks
        .rename(columns={"record_id_raw_input_file": "record_id"})[["record_id", "pik"]]
        .drop_duplicates()
)

In [74]:
census_2030_piked = census_2030_raw_input.copy()
census_2030_piked = census_2030_piked.merge(
    pik_values,
    how="left",
    on="record_id",
    validate="1:1",
)
census_2030_piked

Unnamed: 0,record_id,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,year,pik
0,simulated_census_2030_0,0_8033,Gerald,R,Allen,86,11/03/1943,1130,mallory ln,,Anytown,WA,00000,Household,Reference person,Male,Black,2030,89484
1,simulated_census_2030_1,0_1066,April,S,Hayden,33,10/23/1996,32597,delacorte dr,,Anytown,WA,00000,Household,Other nonrelative,Female,Black,2030,98736
2,simulated_census_2030_2,0_1066,Loretta,T,Lowe,71,06/01/1958,32597,delacorte dr,,Anytown,WA,00000,Household,Reference person,Female,White,2030,91258
3,simulated_census_2030_3,0_2514,Sandra,A,Sorrentino,75,03/18/1954,4458,wibdsor pl,,Anytown,WA,00000,Household,Reference person,Female,Multiracial or Other,2030,90622
4,simulated_census_2030_4,0_5627,Bobby,S,Baker,44,05/20/1985,,winding trail rd,,Anytown,WA,00000,Household,Other nonrelative,Male,White,2030,96379
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11024,simulated_census_2030_11024,0_10778,Jeremy,T,Boyd,46,07/01/1983,211,quiet wsy,,Anytown,WA,00000,Household,Reference person,Male,Black,2030,95963
11025,simulated_census_2030_11025,0_11001,Wendy,M,Gross,54,12/05/1975,2801,blje rdv dr n,,Anytown,WA,00000,Household,Reference person,Female,White,2030,
11026,simulated_census_2030_11026,0_5308,Ember,H,Samuels,10,10/26/2019,24113,lauder,,Anytown,WA,00000,Household,Reference person,Female,Black,2030,104164
11027,simulated_census_2030_11027,0_10693,Athena,V,Deshpande,27,07/05/2002,1534,bentley dr,,Anytown,WA,00000,Household,Reference person,Female,Asian,2030,106182


In [75]:
piked_proportion = census_2030_piked.pik.notnull().mean()
# Compare with 90.28% of input records PIKed in the 2010 CUF,
# as reported in Wagner and Layne, Table 2, p. 18 
print(f'{piked_proportion:.2%} of the input records were PIKed')

93.83% of the input records were PIKed


In [76]:
census_2030_piked.to_parquet(f'{output_dir}/{data_to_use}/census_2030_piked.parquet')

In [77]:
person_linkage_cascade.confirmed_piks.to_parquet(f'{output_dir}/{data_to_use}/confirmed_piks.parquet')