# **Person linkage case study with small-scale sample data**

In this case study, we imagine running a person linkage to add unique identifiers to a simulated 2030 Census Unedited File (CUF).
We emulate the methods of the Census Bureau's Person Identification Validation System (PVS) using publicly available descriptions.
The PVS aims to give people in the input file (the CUF, in this case) unique identifiers that can be used to link them to other administrative records.

We approximate the (highly confidential) input data to PVS by using simulated data from our pseudopeople package.
See the `generate_simulated_data` notebook for details about this.

## Sources

These papers describe PVS as their main subject:

* Wagner and Layne. The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications’ (CARRA) Record Linkage Software. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-01.pdf ([archived](https://web.archive.org/web/20230216043235/https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-01.pdf))
* Layne, Wagner, and Rothhaas. Estimating Record Linkage False Match Rate for the Person Identification Validation System. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf ([archived](https://web.archive.org/web/20220121051156/https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf))
* Alexander et al. Creating a Longitudinal Data Infrastructure at the Census Bureau. 2015. https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/2015-alexander.pdf ([archived](https://web.archive.org/web/20170723192315/http://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/2015-alexander.pdf))
* Massey and O'Hara. Person Matching in Historical Files using the Census Bureau’s Person Validation System. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-11.pdf ([archived](https://web.archive.org/web/20221018074814/http://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-11.pdf))
* NORC. Assessment of the U.S. Census Bureau’s Person Identification Validation System. 2011. https://www.norc.org/content/dam/norc-org/pdfs/PVS%20Assessment%20Report%20FINAL%20JULY%202011.pdf ([archived](https://web.archive.org/web/20230705005935/https://www.norc.org/content/dam/norc-org/pdfs/PVS%20Assessment%20Report%20FINAL%20JULY%202011.pdf))

These apply PVS to some linking task, and in doing so describe it in some detail:

* Brown et al. Real-Time 2020 Administrative Record Census Simulation. 2023. https://www2.census.gov/programs-surveys/decennial/2020/program-management/evaluate-docs/EAE-2020-admin-records-experiment.pdf ([archived](https://web.archive.org/web/20230521191811/https://www2.census.gov/programs-surveys/decennial/2020/program-management/evaluate-docs/EAE-2020-admin-records-experiment.pdf))
* Massey et al. Linking the 1940 U.S. Census with Modern Data. 2018. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530596/ ([DOI](https://doi.org/10.1080%2F01615440.2018.1507772))
* Luque and Wagner. Assessing Coverage and Quality of the 2007 Prototype Census Kidlink Database. 2015. https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-07.pdf ([archived](https://web.archive.org/web/20220808205231/http://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-07.pdf))
* Bond et al. The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-08.pdf ([archived](https://web.archive.org/web/20220803083857/http://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-08.pdf))

Finally, this paper is about the address matching process (MAF Match) which acts as part of PVS:

* Brummet. Comparison of Survey, Federal, and Commercial Address Data Quality. 2014. https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-06.pdf ([archived](https://web.archive.org/web/20220121213358/https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-06.pdf))

All of these papers are referenced throughout this notebook (and other notebooks) by the authors' names, along with additional information like page numbers.
All page numbers represent those of the PDF files themselves, *not* page numbers printed in the documents.

## PVS overview

Brown et al., p. 28:

> The PVS uses probabilistic record linkage (Fellegi and Sunter, 1969) to match data from an
    incoming file (e.g., a survey or administrative record file) to reference files containing data on
    SSN applications from the NUMIDENT enhanced with address data obtained from other federal
    administrative records.

## Setup

In [1]:
import re, copy, os
import pandas as pd, numpy as np
import jellyfish

In [2]:
data_to_use = 'small_sample'
input_dir = 'generate_simulated_data/output'
output_dir = 'output'

# Load data

See code in `generate_simulated_data` directory for how we generated the files to link

In [3]:
census_2030 = pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_census_2030.parquet')
geobase_reference_file = pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_geobase_reference_file.parquet')
name_dob_reference_file = pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_name_dob_reference_file.parquet')

# Pre-process input file

Wagner and Layne, p. 9:

> The first step of the PVS process is to edit data fields to make them homogenous for
comparisons between incoming and reference files.

In [4]:
# Input file before any processing; the final result will be this with the PIK column added
census_2030_raw_input = census_2030.copy()

## Name parsing and standardizing

Wagner and Layne, p. 9:

> The first edits are parsing and standardizing - parsing separates fields into
component parts, while standardizing guarantees key data elements are consistent (e.g.,
STREET, STR are both converted to ST). Name and address fields are parsed and
standardized as they are key linkage comparators. 

As noted in the data generation notebook, there is no parsing to be done here:
because the Census questionnnaire asks for first name, middle initial, and last name as
separate fields on the form, the name would be already parsed when running PVS on the CUF.

The real PVS name parser generates fields like prefix (e.g. Mr.) and suffix (e.g. Jr.)
but we never have these in names from pseudopeople.

More details about the name parser and standardizer (Wagner and Layne, p. 10):

> The PVS system incorporates a name
standardizer (McGaughey, 1994), which is a C language subroutine called as a function
within SAS. It performs name parsing and includes a nickname lookup table and outputs
name variants (standardized variations of first and last names). For example, Bill
becomes William, Chuck and Charlie becomes Charles, etc. The PVS keeps both the
original name (Bill) and the converted name (William) for matching. PVS also has a fake
name table to blank names such as “Queen of the House” or “Baby Girl.” The name data
are parsed, checked for nicknames, and standardized.

The key thing that wasn't clear to me from this description was what it meant to
"keep both original name and the converted name for matching."
I found this in Massey et al.:

> The Name Search Module also accounts for instances where census records contain a nickname.
> For these records, the preprocessing step of the Name Search module outputs two records for
> these observations, one record for the nickname and one record for the formal name.
> For example, if the input record has the name “Bill Smith,” the formatting program will add
> a formal name “William” to that record. This record will then output to both the B-S cut and to the W-S cut.

From this, I gather that nicknames work similar to alternate names in the reference file:
entire duplicate records are made with each name (nickname and formal).
However, I am still confused by calling this "the preprocessing step **of the Name Search module**."
In Wagner and Layne it does not seem like this is module-specific, and I don't see any
reason that would be desirable, so I have done this multi-record approach across all modules.

In [5]:
# Nickname processing
# Have not yet found a nickname list in PVS docs,
# so we do a minimal version for now -- could use
# another list such as the one in pseudopeople
# These examples all come directly from examples in the descriptions of PVS
nickname_standardizations = {
    "Bill": "William",
    "Chuck": "Charles",
    "Charlie": "Charles",
    "Cathy": "Catherine",
    "Matt": "Matthew",
}
has_nickname = census_2030.first_name.isin(nickname_standardizations.keys())
print(f'{has_nickname.sum()} nicknames in the Census')

# Add extra rows for the normalized names
census_2030 = pd.concat([
    census_2030,
    census_2030[has_nickname].assign(first_name=lambda df: df.first_name.replace(nickname_standardizations))
], ignore_index=True)

# Note: The above will introduce duplicates on record_id, so we redefine
# record_id to be unique (without getting rid of the original, input file record ID)
def add_unique_record_id(df, dataset_name):
    df = df.reset_index().rename(columns={'index': 'record_id'})
    df['record_id'] = f'{dataset_name}_' + df.record_id.astype(str)
    return df

census_2030 = add_unique_record_id(
    census_2030.rename(columns={'record_id': 'record_id_raw_input_file'}),
    "census_2030_preprocessed",
)

4 nicknames in the Census


In [6]:
# This list of fake names comes from the NORC report, p. 100-101
# It is what was used in PVS as of 2011
with open('fake-names.txt') as f:
    fake_names = pd.Series([name.strip().upper() for name in f.readlines()])

assert (fake_names == fake_names.str.upper()).all()
fake_names

0        (CONFIDENTIAL)
1      (NO MIDDLE NAME)
2           A RELUCTANT
3                 ADULT
4               ADULT F
             ...       
766       YOUNGEST DAUG
767       YOUNGEST GIRL
768              YR BOY
769             YR GIRL
770              YR OLD
Length: 771, dtype: object

In [7]:
for col in ["first_name", "last_name"]:
    has_fake_name = census_2030[col].str.upper().isin(fake_names)
    print(f'Found {has_fake_name.sum()} fake names in {col}')
    census_2030[col] = np.where(
        has_fake_name,
        np.nan,
        census_2030[col]
    )

Found 60 fake names in first_name
Found 57 fake names in last_name


## Address parsing and standardizing

Wagner and Layne, p. 11:

> The PVS editing process also incorporates an address parser and standardizer,
written in the C language and called as a function within SAS (U.S. Census Bureau
Geography Division, 1995). It performs parsing of address strings into individual output
fields (see Figure 2), and standardizes the spelling of key components of the address
such as street type. The PVS also incorporates use of a commercial product to update
zip codes, and correct misspellings of address elements.

As noted in the data generation notebook, for now we haven't combined the address parts
into a single string that would need to be parsed.
We plan to add this in a future version of pseudopeople.

For now, we do some simple standardization in each of the address parts (street number,
street name, etc).

In [8]:
def standardize_address_part(column):
    return (
        column
            # Remove leading or trailing whitespace
            .str.strip()
            # Turn any strings of consecutive whitespace into a single space
            .str.split().str.join(' ')
            # Normalize case
            .str.upper()
            # Normalize the word street as described in the example quoted from
            # Wagner and Layne p. 9
            # In reality, there would be many rules like this
            .str.replace('\b(STREET|STR)\b', 'ST', regex=True)
            # Make sure missingness is represented consistently
            .replace('', np.nan)
    )

address_cols = ['street_number', 'street_name', 'unit_number', 'city', 'state', 'zipcode']
census_2030[address_cols] = census_2030[address_cols].apply(standardize_address_part)

## MAFMatch

**General information about MAFMatch**

Wagner and Layne, p. 11:

> The PVS provides an additional address enhancement by matching records in the
incoming file to Census Bureau’s Master Address File (MAF) in order to assign a unique
address identifier, the MAF Identifier (MAFID), and other Census geographical codes
(e.g., Census tract and block). The MAFID is used in the PVS for search purposes and as a
linkage key for administrative files. Then, addresses are matched to the Census Bureau’s
Topologically Integrated Geographic Encoding and Referencing Database (TIGER) to
obtain Census geographical codes.

My understanding is that this step is useful for two reasons:
- Potentially adds an alternate version/way of formatting the same address
  (the way it is in the MAF, instead of the way it is in the input file).
  However, instead of using this alternate to match directly, it is boiled down
  to a MAFID, so it is only useful when an input file address and a reference file
  address (different to one another) both match to the same MAF address.
  (Presumably, the whole MAFMatch process described here needs to run on the reference files?)
- Adds Census geographies which could be used in blocking (but are they?)

I don't understand the TIGER part of this.
In particular, p. 12 describes a probabilistic/fuzzy match to TIGER, but I thought TIGER
would contain exactly the same addresses as the MAF.
I also don't understand what geocodes would be present in TIGER that wouldn't also be
present in the MAF.
Maybe these things are more out-of-sync with each other than I understood.

**MAFMatch in this case study**

In the decennial Census, the sampling frame is a subset of the MAF (Brown et al., p. 15, footnote 4).
That is, in the CUF, the address field (which is not supplied by the respondent)
would be the MAF address the questionnaire/NRFU was sent to.

Therefore, I don't believe it makes sense to do MAFMatch for the CUF, because all the addresses
are already the same.
I haven't fully confirmed this, but p. 14 of Wagner and Layne says

> The 2010 Census Unedited File (CUF), had 350 million records and processed through every PVS
module, excluding MAF match and SSN verification

which suggests that (like SSN verification) MAFMatch is not applicable to the CUF.

Given this, we skip MAFMatch here.
Also, if we wanted to add it, we would need to generate something like a MAF from pseudopeople.

## Drop records with insufficient information

In 2011 (NORC, p. 25):

> The initial edit process, described in the Introduction: PVS Background section, removes from consideration
incoming records that have no name data. Therefore, no record that is processed in PVS has blank first and last
names.

In 2014 it appears to be the same/similar, because in Table 2 of Wagner and Layne there is a row "NO SEARCH: Blank Name" (p. 18).

In 2023 (Brown et al., p. 28):

> Records containing sufficient PII to be linkable with some confidence, for example those
containing name and age, are sent through the linkage process.<sup>15</sup>

> <sup>15</sup> Records with names on the PVS invalid name list (e.g., “Mickey Mouse,” “householder,” or “son”) are excluded
from PVS search.

I'd prefer to use the 2023 information, but it is too vague:
"for example" means this is just an approximation, and it doesn't specify
parts of "name."
The footnote also seems to contradict earlier reports, which said fake names were simply
blanked out, which seems preferable.
Since the fake name step comes before this one, a fake first **and** last name would lead
to exclusion here, which is perhaps what the footnote was intending?

Here, we follow the blank-name approach.

In [9]:
census_2030 = census_2030[
    census_2030.first_name.notnull() |
    census_2030.last_name.notnull()
]

# Create derived variables for use in linkage

Here we create variables to be used as matching variables and blocking keys,
when those matching variables/blocking keys are defined in a way that is not already
present in the data at this point.

The variables needed here depend on the modules and passes described below -- see those
sections for more citations.

In [10]:
# We want to compare mailing address with physical address
geobase_reference_file = geobase_reference_file.rename(columns=lambda c: c.replace('mailing_address_', ''))

In [11]:
# PVS uses DOB as separate fields for day, month, and year
def split_dob(df, date_format='%Y%m%d'):
    df = df.copy()
    # Have to be floats because we want to treat as numeric for assessing similarity
    # Note that as of now, none of our pseudopeople noise types would change the punctuation ("/") in the date, but
    # they can insert non-numeric characters here or otherwise create invalid dates, in which case we fail to parse the date
    # and treat it as missing.
    dob = pd.to_datetime(df.date_of_birth, format=date_format, errors='coerce')
    df['month_of_birth'] = dob.dt.month
    df['year_of_birth'] = dob.dt.year
    df['day_of_birth'] = dob.dt.day
    return df.drop(columns=['date_of_birth'])

census_2030 = split_dob(census_2030, date_format='%m/%d/%Y')
geobase_reference_file = split_dob(geobase_reference_file)
name_dob_reference_file = split_dob(name_dob_reference_file)

In [12]:
# I don't fully understand the purpose of blocking on the geokey,
# as opposed to just blocking on its constituent columns.
# Maybe it is a way of dealing with missingness in those constituent
# columns (e.g. so an address with no unit number can still be blocked on geokey)?
def add_geokey(df):
    df = df.copy()
    df['geokey'] = (
        df.street_number + ' ' +
        df.street_name + ' ' +
        df.unit_number.fillna('') + ' ' +
        df.city + ' ' +
        df.state.astype(str) + ' ' +
        df.zipcode
    )
    # Normalize the whitespace -- necessary if the unit number was null
    df['geokey'] = (
        df.geokey.str.split().str.join(' ')
    )
    return df

geobase_reference_file = add_geokey(geobase_reference_file)
census_2030 = add_geokey(census_2030)

In [13]:
# Layne, Wagner, and Rothhaas p. 26: the name matching variables are
# First 15 characters First Name, First 15 characters Middle Name, First 12 characters Last Name
# Additionally, there are blocking columns for all of 1-3 initial characters of First/Last.
# We don't have a full middle name in pseudopeople (nor would that be present in a real CUF)
# so we have to stick to the first initial for middle.
def add_truncated_name_cols(df):
    df = df.copy()
    df['first_name_15'] = df.first_name.str[:15]
    df['last_name_12'] = df.last_name.str[:12]

    if 'middle_name' in df.columns and 'middle_initial' not in df.columns:
        df['middle_initial'] = df.middle_name.str[:1]

    for num_chars in [1, 2, 3]:
        df[f'first_name_{num_chars}'] = df.first_name.str[:num_chars]
        df[f'last_name_{num_chars}'] = df.last_name.str[:num_chars]

    return df

census_2030 = add_truncated_name_cols(census_2030)
geobase_reference_file = add_truncated_name_cols(geobase_reference_file)
name_dob_reference_file = add_truncated_name_cols(name_dob_reference_file)

In [14]:
# Layne, Wagner, and Rothhaas p. 26: phonetics are used in blocking (not matching)
# - Soundex for Street Name
# - NYSIIS code for First Name
# - NYSIIS code for Last Name
# - Reverse Soundex for First Name
# - Reverse Soundex for Last Name

def add_name_phonetics(df):
    df = df.copy()

    for col in ['first_name', 'last_name']:
        df[f'{col}_nysiis'] = df[col].dropna().apply(jellyfish.nysiis)
        df[f'{col}_reverse_soundex'] = df[col].dropna().str[::-1].apply(jellyfish.soundex)

    return df

def add_address_phonetics(df):
    df = df.copy()
    df['street_name_soundex'] = df.street_name.dropna().apply(jellyfish.soundex)
    return df

census_2030 = add_name_phonetics(census_2030)
census_2030 = add_address_phonetics(census_2030)

geobase_reference_file = add_address_phonetics(geobase_reference_file)

name_dob_reference_file = add_name_phonetics(name_dob_reference_file)

In [15]:
# Columns used to "cut the database": ZIP3 and a grouping of first and last initial
def add_zip3(df):
    return df.assign(zip3=lambda x: x.zipcode.str[:3])

def add_first_last_initial_categories(df):
    # Page 20 of the NORC report: "Name-cuts are defined by combinations of the first characters of the first and last names. The twenty letter groupings
    # for the first character are: A-or-blank, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, and U-Z."
    initial_cut = lambda x: x.fillna('A').str[0].replace('A', 'A-or-blank').replace(['U', 'V', 'W', 'X', 'Y', 'Z'], 'U-Z')
    return df.assign(first_initial_cut=lambda x: initial_cut(x.first_name), last_initial_cut=lambda x: initial_cut(x.last_name))

In [16]:
census_2030 = add_zip3(census_2030)
census_2030 = add_first_last_initial_categories(census_2030)

geobase_reference_file = add_zip3(geobase_reference_file)

name_dob_reference_file = add_first_last_initial_categories(name_dob_reference_file)

# Data, ready to link

In [17]:
census_2030

Unnamed: 0,record_id,record_id_raw_input_file,household_id,first_name,middle_initial,last_name,age,street_number,street_name,unit_number,...,first_name_3,last_name_3,first_name_nysiis,first_name_reverse_soundex,last_name_nysiis,last_name_reverse_soundex,street_name_soundex,zip3,first_initial_cut,last_initial_cut
0,census_2030_preprocessed_0,simulated_census_2030_0,0_8033,Gerald,R,Allen,86,1130,MALLORY LN,,...,Ger,All,GARALD,D462,ALAN,N400,M464,000,G,A-or-blank
1,census_2030_preprocessed_1,simulated_census_2030_1,0_1066,April,S,Hayden,33,32597,DELACORTE DR,,...,Apr,Hay,APRAL,L610,HAYDAN,N300,D426,000,A-or-blank,H
2,census_2030_preprocessed_2,simulated_census_2030_2,0_1066,Loretta,T,Lowe,71,32597,DELACORTE DR,,...,Lor,Low,LARAT,A364,LAO,E400,D426,000,L,L
3,census_2030_preprocessed_3,simulated_census_2030_3,0_2514,Sandra,A,Sorrentino,75,4458,WIBDSOR PL,,...,San,Sor,SANDR,A635,SARANTAN,O535,W132,000,S,S
4,census_2030_preprocessed_4,simulated_census_2030_4,0_5627,Bobby,S,Baker,44,,WINDING TRAIL RD,,...,Bob,Bak,BABY,Y110,BACAR,R210,W535,000,B,B
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11028,census_2030_preprocessed_11028,simulated_census_2030_11028,0_10693,Zechariah,C,Deshpande,0,1534,BENTLEY DR,,...,Zec,Des,ZACAR,H622,DASPAND,E351,B534,000,U-Z,D
11029,census_2030_preprocessed_11029,simulated_census_2030_3803,0_492,Charles,P,Foreman,64,224,N SANFORD ST,,...,Cha,For,CARL,S462,FARANAN,N561,N251,000,C,F
11030,census_2030_preprocessed_11030,simulated_census_2030_6806,0_11238,William,S,Vick,59,829,BERKELEY AVE,,...,Wil,Vic,WALAN,M400,VAC,K100,B624,000,U-Z,U-Z
11031,census_2030_preprocessed_11031,simulated_census_2030_7705,0_6764,Charles,P,Qusmus,63,6801,REDWOOD TERRACE,,...,Cha,Qus,CARL,S462,QASN,S522,R333,000,C,Q


In [18]:
geobase_reference_file

Unnamed: 0,record_id,ssn,first_name,middle_name,last_name,street_number,street_name,unit_number,po_box,city,...,last_name_12,middle_initial,first_name_1,last_name_1,first_name_2,last_name_2,first_name_3,last_name_3,street_name_soundex,zip3
0,simulated_geobase_reference_file_0,685-77-0916,Betty,Audrey,Keel,47461,W ROXBURY DR,,,ANYTOWN,...,Keel,A,B,K,Be,Ke,Bet,Kee,W621,000
1,simulated_geobase_reference_file_1,765-44-4521,Ethep,Nancy,Collier,,,,,,...,Collier,N,E,C,Et,Co,Eth,Col,,
2,simulated_geobase_reference_file_2,765-44-4521,Ethep,Nancy,Collier,,,,,,...,Collier,N,E,C,Et,Co,Eth,Col,,
3,simulated_geobase_reference_file_3,726-57-2168,Josephine,Margaret,Babbie,3743,MESA VERDE ST,,,ANYTOWN,...,Babbie,M,J,B,Jo,Ba,Jos,Bab,M216,000
4,simulated_geobase_reference_file_4,726-57-2168,Josephine,Margaret,Babbie,3743,MESA VERDE ST,,,,...,Babbie,M,J,B,Jo,Ba,Jos,Bab,M216,000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32820,simulated_geobase_reference_file_32820,939-53-1702,Josiah,L,Mcmiller,2904,YACHT CLUB PT,,,ANYTOWN,...,Mcmiller,L,J,M,Jo,Mc,Jos,Mcm,Y232,000
32821,simulated_geobase_reference_file_32821,939-53-1702,Josiah,L,Mcmiller,W14066,E VALENCIA RD,,,ANYTOWN,...,Mcmiller,L,J,M,Jo,Mc,Jos,Mcm,E145,000
32822,simulated_geobase_reference_file_32822,939-53-1702,Josiah,L,Mcmiller,W14066,E VALENCIA RD,,,ANYTOWN,...,Mcmiller,L,J,M,Jo,Mc,Jos,Mcm,E145,000
32823,simulated_geobase_reference_file_32823,993-56-2286,Larisa,A,Beilfuss,1411,W 80TH ST,,,ANYTOWN,...,Beilfuss,A,L,B,La,Be,Lar,Bei,W323,000


In [19]:
name_dob_reference_file

Unnamed: 0,record_id,ssn,first_name,middle_name,last_name,pik,month_of_birth,year_of_birth,day_of_birth,first_name_15,...,first_name_2,last_name_2,first_name_3,last_name_3,first_name_nysiis,first_name_reverse_soundex,last_name_nysiis,last_name_reverse_soundex,first_initial_cut,last_initial_cut
0,simulated_name_dob_reference_file_0,685-77-0916,Betty,Audrey,Keel,105906,12.0,1922.0,5.0,Betty,...,Be,Ke,Bet,Kee,BATY,Y310,CAL,L200,B,K
1,simulated_name_dob_reference_file_1,765-44-4521,Ethep,Nancy,Collier,104653,,,,Ethep,...,Et,Co,Eth,Col,ETAP,P300,CALAR,R420,E,C
2,simulated_name_dob_reference_file_2,765-44-4521,Ethep,Nancy,Collier,104653,2.0,1923.0,13.0,Ethep,...,Et,Co,Eth,Col,ETAP,P300,CALAR,R420,E,C
3,simulated_name_dob_reference_file_3,726-57-2168,Josephine,Margaret,Babbie,106223,7.0,1923.0,28.0,Josephine,...,Jo,Ba,Jos,Bab,JASAFAN,E512,BABY,E110,J,B
4,simulated_name_dob_reference_file_4,365-44-3027,Betty,Mary,,107896,8.0,1923.0,9.0,Betty,...,Be,,Bet,,BATY,Y310,,,B,A-or-blank
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19870,simulated_name_dob_reference_file_19870,977-84-6791,Liam,L,Wheatley,108816,,,,Liam,...,Li,Wh,Lia,Whe,LAN,M400,WATLY,Y430,L,U-Z
19871,simulated_name_dob_reference_file_19871,997-95-0405,Joseph,D,Dellinger,108817,,,,Joseph,...,Jo,De,Jos,Del,JASAF,H122,DALANGAR,R254,J,D
19872,simulated_name_dob_reference_file_19872,939-53-1702,Josiah,L,Mcmiller,108818,,,,Josiah,...,Jo,Mc,Jos,Mcm,JAS,H220,MCNALAR,R452,J,M
19873,simulated_name_dob_reference_file_19873,993-56-2286,Larisa,A,Beilfuss,108819,,,,Larisa,...,La,Be,Lar,Bei,LARAS,A264,BALF,S141,L,B


# Emulate Multi-Match with fastLink

Wagner and Layne, p. 8:

> The PVS employs its probabilistic record linkage software, Multi-Match (Wagner
2012), as an integral part of the PVS.

Wagner and Layne, p. 12:

> PVS uses the same Multi-Match engine for each probabilistic search type. For each
search module the analyst defines a parameter file, which is passed to Multi-Match. The
parameter file includes threshold value(s) for the number of passes, blocking keys, and
within each pass, the match variables, match comparison type, and matching weights...
>
> Records must first match exactly on the blocking keys before any comparisons
between the match variables are attempted. Each match variable is given an
m and
u
probability, which is translated by MultiMatch as agreement and disagreement weights.
The sum of all match variable comparison weights for a record pair is the composite
weight. All record pairs with a composite weight greater than or equal to the threshold
set in the parameter file are linked, and the records from the incoming file for these
linked cases are excluded from all remaining passes. All Numident records are always
available for linking in every pass. Any record missing data for any of the blocking fields
for a pass skips that pass and moves to the next pass.

[fastLink](https://github.com/kosukeimai/fastLink) is similar to Multi-Match
in that both are based on the Fellegi-Sunter approach to record linkage.
However, it does not include blocking, and using the recommended approach of calling
it separately on each block
[does not perform well for very small and numerous blocks](https://github.com/kosukeimai/fastLink/issues/73#issuecomment-1672077417).

Given these differences, it isn't currently possible to emulate the PVS cascade with fastLink.
Instead, we do a single blocking pass.

## Estimate parameters

In Multi-Match parameters are not directly estimated from the data.
They are primarily set manually by analysts, with a different set of parameter files
maintained for each type of input file (e.g. survey, administrative).

We estimate the parameters using the GeoBase Reference File,
since it has all columns that are used for matching with any reference file.

In [20]:
import sys, pathlib
import os
# Use R in the current conda environment
os.environ["R_HOME"] = str(pathlib.Path(sys.executable).parent.parent / 'lib/R')

In [21]:
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

pandas2ri.activate()

fastLink = importr('fastLink')

In [22]:
%%time

# From the fastLink README:
# ## Run the algorithm on the random samples
# rs.out <- fastLink(
#   dfA = dfA.s, dfB = dfB.s, 
#   varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
#   stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
#   partial.match = c("firstname", "lastname", "streetname"),
#   estimate.only = TRUE
# )

from rpy2 import robjects as ro

COMPARISON_COLUMNS = ["first_name", "middle_initial", "last_name", "day_of_birth", "month_of_birth", "year_of_birth", "geokey"]

prep_for_fastLink = lambda df, *args: df[COMPARISON_COLUMNS].astype(str).fillna(ro.NA_Character).reset_index().rename(columns={'index': 'python_index'})

em_object = fastLink.fastLink(
    dfA = prep_for_fastLink(geobase_reference_file),
    dfB = prep_for_fastLink(census_2030),
    varnames = ro.StrVector(COMPARISON_COLUMNS),
    stringdist_match = ro.StrVector(["first_name", "last_name", "geokey"]),
    partial_match = ro.StrVector(["first_name", "last_name", "geokey"]),
    # Just run EM, don't link
    estimate_only = True,
)


fastLink(): Fast Probabilistic Record Linkage

If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
Calculating matches for each variable.
Getting counts for parameter estimation.
    Parallelizing calculation using OpenMP. 3 threads out of 2 are used.
Running the EM algorithm.
CPU times: user 2min 32s, sys: 8.91 s, total: 2min 41s
Wall time: 2min 19s


## Implement matching passes

In [23]:
from dataclasses import dataclass

# Calculate this once to save time -- mapping from record_id to record_id_raw_input_file
# There can be multiple records (with different record_id) for the same input file record
# (record_id_raw_input_file) because of the handling of nicknames by creating extra records.
record_id_raw_input_file_by_record_id = census_2030.set_index('record_id').record_id_raw_input_file

all_piks = pd.concat([
    name_dob_reference_file[["record_id", "pik"]],
    geobase_reference_file[["record_id", "pik"]]
], ignore_index=True).set_index("record_id").pik

dates_of_death = (
    pd.read_parquet(f'{input_dir}/{data_to_use}/simulated_census_numident.parquet')
        .set_index('pik')
        .date_of_death
        .pipe(lambda s: pd.to_datetime(s, format='%Y%m%d', errors='coerce'))
)

class PersonLinkageCascade:
    def __init__(self):
        # This dataframe will accumulate the PIKs to attach to the input file
        self.confirmed_piks = pd.DataFrame(columns=["record_id_raw_input_file", "pik"])
        self.current_module = None

    def start_module(self, *args, **kwargs):
        assert self.current_module is None or self.current_module.confirmed
        self.current_module = PersonLinkageModule(*args, **kwargs)

    def run_matching_pass(self, *args, **kwargs):
        self.current_module.run_matching_pass(*args, already_confirmed_piks=self.confirmed_piks, **kwargs)

    def confirm_piks(self, *args, **kwargs):
        # Make sure we are not about to confirm PIKs for any of the same files we have
        # already PIKed
        assert (
            set(self.current_module.provisional_links.record_id_raw_input_file) &
            set(self.confirmed_piks.record_id_raw_input_file)
        ) == set()

        newly_confirmed_piks = self.current_module.confirm_piks_from_provisional_links()

        self.confirmed_piks = pd.concat([
            self.confirmed_piks,
            newly_confirmed_piks,
        ], ignore_index=True)

        return self.confirmed_piks

@dataclass
class PersonLinkageModule:
    name: str
    reference_file: pd.DataFrame
    reference_file_name: str
    # cut_columns: list[str]
    matching_columns: list[str]

    def __post_init__(self):
        self.provisional_links = pd.DataFrame(columns=["record_id_census_2030"])
        self.confirmed = False

    def run_matching_pass(
        self,
        pass_name,
        probability_threshold=0.99,
        input_data_transformation=lambda x: x,
        already_confirmed_piks=pd.DataFrame(columns=["record_id_raw_input_file"]),
    ):
        assert self.confirmed == False

        print(f"Running {pass_name} of {self.name}")

        census_to_match = (
            census_2030[
                # Only look for matches among records that have not received a confirmed PIK
                ~census_2030.record_id_raw_input_file.isin(already_confirmed_piks.record_id_raw_input_file) &
                # Only look for matches among records that have not received a provisional link
                # NOTE: "records" here does not mean input file records -- a nickname record having
                # a provisional link does not prevent a canonical name record from the same input record
                # from continuing to match
                ~census_2030.record_id.isin(self.provisional_links.record_id_census_2030)
            ].pipe(input_data_transformation)
        )
        # census_2030_for_fastLink = prep_for_fastLink(census_to_match)
        # reference_file_for_fastLink = prep_for_fastLink(self.reference_file, self.reference_file_name)
        print(f"Files to link are {len(census_to_match):,.0f} and {len(self.reference_file):,.0f} records")

        # fastLink is slow with blocking, as noted above
        # We don't actually do any blocking here, but we do have a homegrown method for doing it
        # fastLink's blocking (blockData) also doesn't support blocking on multiple columns at once(!),
        # so we implement our own blocking
        # Technically we could have done it with some hacky approach involving appending the columns,
        # but fastLink still requires you to make a separate linking call for each block anyway
        census_to_match['dummy_blocking_key'] = 1
        self.reference_file['dummy_blocking_key'] = 1
        # I experimented with blocking on the cut columns only as a compromise, but it was still too slow
        blocking_cols = ['dummy_blocking_key'] # self.cut_columns
        census_2030_blocks = census_to_match.groupby(blocking_cols, as_index=False)
        reference_file_blocks = self.reference_file.groupby(blocking_cols, as_index=False)

        print(f'{census_2030_blocks.ngroups} blocks')

        varnames = ro.StrVector(COMPARISON_COLUMNS)
        stringdist_match = ro.StrVector(["first_name", "last_name", "geokey"])
        partial_match = ro.StrVector(["first_name", "last_name", "geokey"])

        new_provisional_links = []
        for index, (key, census_2030_block) in enumerate(census_2030_blocks):
            try:
                # Weirdly, there is an inconsistency between the keys you get when iterating and the ones
                # you have to pass to get_group
                if isinstance(key, tuple) and len(key) == 1:
                    key = key[0]
                reference_file_block = reference_file_blocks.get_group(key)
            except KeyError:
                # Nothing in the reference file for this block; so that implies there are no
                # matches to find
                continue

            if len(reference_file_block) == 1:
                # HACK -- fastLink seems to not work at all if dfB is only one row
                reference_file_block = pd.concat([reference_file_block, pd.DataFrame(np.nan, index=[-1], columns=reference_file_block.columns)])

            census_2030_block_for_fastLink = prep_for_fastLink(census_2030_block)
            reference_file_block_for_fastLink = prep_for_fastLink(reference_file_block)
            with (ro.default_converter + pandas2ri.converter).context():
                conversion = ro.conversion.get_conversion()
                census_2030_block_for_fastLink_r = conversion.py2rpy(census_2030_block_for_fastLink)
                reference_file_block_for_fastLink_r = conversion.py2rpy(reference_file_block_for_fastLink)

            fastLink_result = fastLink.fastLink(
                dfA=census_2030_block_for_fastLink_r,
                dfB=reference_file_block_for_fastLink_r,
                varnames=varnames,
                stringdist_match=stringdist_match,
                partial_match=partial_match,
                em_obj=em_object,
                threshold_match=probability_threshold,
            )

            census_2030_matches_r_indices = fastLink_result.rx2('matches').rx2('inds.a')
            reference_file_matches_r_indices = fastLink_result.rx2('matches').rx2('inds.b')

            census_2030_matches = pd.Index(census_2030_block_for_fastLink_r.rx(census_2030_matches_r_indices, 'python_index'))
            reference_file_matches = pd.Index(reference_file_block_for_fastLink_r.rx(reference_file_matches_r_indices, 'python_index'))

            new_provisional_links.append(
                census_2030_block.loc[census_2030_matches].reset_index(drop=True).add_suffix('_census_2030')
                .join(
                    reference_file_block.loc[reference_file_matches].reset_index(drop=True).add_suffix('_reference_file')
                )
                .join(
                    pd.Series(fastLink_result.rx2('posterior'), name='match_probability')
                )
            )

        if len(new_provisional_links) > 0:
            new_provisional_links = pd.concat(new_provisional_links, ignore_index=True)

        if len(new_provisional_links) > 0:
            new_provisional_links["record_id_raw_input_file"] = (
                new_provisional_links.record_id_census_2030.map(record_id_raw_input_file_by_record_id)
            )

            self.provisional_links = pd.concat([
                self.provisional_links,
                new_provisional_links.assign(module_name=self.name, pass_name=pass_name)
            ], ignore_index=True)

        still_eligible = (
            (~census_2030.record_id_raw_input_file.isin(already_confirmed_piks.record_id_raw_input_file)) &
            (~census_2030.record_id.isin(self.provisional_links.record_id_census_2030))
        )
        print(f'Matched {len(new_provisional_links)} records; {still_eligible.mean():.2%} still eligible to match')

    def confirm_piks_from_provisional_links(self):
        assert not self.confirmed

        provisional_links = self.provisional_links
        provisional_links["pik"] = provisional_links.record_id_reference_file.map(all_piks)

        # "After the initial set of links is created in GeoSearch, a post-search program is run to determine
        # which of the links are retained. A series of checks are performed: First the date of death
        # information from the Numident is checked and links to deceased persons are dropped. Next a
        # check is made for more than one SSN assigned to a source record. If more than one SSN is
        # assigned, the best link is selected based on match weights. If no best SSN is determined, all SSNs
        # assigned in the GeoSearch module are dropped and the input record cascades to the next
        # module. A similar post-search program is run at the end of all search modules."
        # - Layne et al. p. 5

        # Drop links to deceased people
        # NOTE: On p. 38 of Brown et al. (2023) it discusses at length the number of PVS matches to deceased
        # people, which should not be possible based on this process.
        # Even though this is more recent, I can't think of a reason why this check would have
        # been *removed* from PVS -- can we chalk this up to something experimental they were doing for
        # the AR Census in the 2023 report?
        link_dates_of_death = provisional_links["pik"].map(dates_of_death)
        # Census day 2030
        deceased_links = link_dates_of_death <= pd.to_datetime("2030-04-01")
        print(f'{deceased_links.sum()} input records linked to deceased people, dropping links')
        provisional_links = provisional_links[~deceased_links]

        # Check for multiple linkage to a single input file record
        max_probability = provisional_links.groupby("record_id_raw_input_file").match_probability.max()
        piks_per_input_file = (
            provisional_links.groupby("record_id_raw_input_file")
                .apply(lambda df: df[df.match_probability == max_probability[df.name]].pik.nunique())
        )

        multiple_piks = piks_per_input_file[piks_per_input_file > 1].index
        print(f'{len(multiple_piks)} input records linked to multiple PIKs, dropping links')
        provisional_links = (
            provisional_links[~provisional_links.record_id_raw_input_file.isin(multiple_piks)]
                .sort_values("match_probability")
                .groupby("record_id_raw_input_file")
                .last()
                .reset_index()
        )

        assert (provisional_links.groupby("record_id_raw_input_file").pik.nunique() == 1).all()

        self.confirmed = True
        self.provisional_links = None
        
        return (
            provisional_links[[
                "record_id_raw_input_file",
                "record_id_census_2030",
                "record_id_reference_file",
                "pik",
                "module_name",
                "pass_name",
                "match_probability",
            ]]
        )

In [24]:
person_linkage_cascade = PersonLinkageCascade()

# Single module

Because there is no ability to block with fastLink (in any performant way), there is only one module.

In [25]:
person_linkage_cascade.start_module(
    name="single_module",
    reference_file=geobase_reference_file,
    reference_file_name="geobase_reference_file",
    matching_columns=[
        "first_name_15",
        "last_name_12",
        "middle_initial",
        "day_of_birth",
        "month_of_birth",
        "year_of_birth",
        "street_number",
        "street_name",
        "unit_number",
        "zipcode",
    ],
)

## Pass 1: regular

In [26]:
person_linkage_cascade.run_matching_pass(
    pass_name="single_module_regular",
)

Running single_module_regular of single_module
Files to link are 11,030 and 32,825 records
1 blocks

fastLink(): Fast Probabilistic Record Linkage

If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
Calculating matches for each variable.
Getting counts for parameter estimation.
    Parallelizing calculation using OpenMP. 3 threads out of 2 are used.
Imputing matching probabilities using provided EM object.
Getting the indices of estimated matches.
    Parallelizing calculation using OpenMP. 3 threads out of 2 are used.
Deduping the estimated matches.
Getting the match patterns for each estimated match.
Matched 10243 records; 7.14% still eligible to match


## Pass 2: switched names

In [27]:
def switch_first_and_last_names(df):
    return (
        df.rename(columns={"first_name": "last_name", "last_name": "first_name"})
            # Re-calculate the truncated versions of first and last.
            # NOTE: It is not necessary to re-calculate the phonetic versions, because
            # those are never used in any pass that has a name switch.
            .pipe(add_truncated_name_cols)
    )

In [28]:
person_linkage_cascade.run_matching_pass(
    pass_name="single_module_switched_names",
    input_data_transformation=switch_first_and_last_names,
)

Running single_module_switched_names of single_module
Files to link are 787 and 32,825 records
1 blocks

fastLink(): Fast Probabilistic Record Linkage

If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
Calculating matches for each variable.
Getting counts for parameter estimation.
    Parallelizing calculation using OpenMP. 3 threads out of 2 are used.
Imputing matching probabilities using provided EM object.
Getting the indices of estimated matches.
    Parallelizing calculation using OpenMP. 3 threads out of 2 are used.
Deduping the estimated matches.
Getting the match patterns for each estimated match.
Matched 9 records; 7.05% still eligible to match


## Post-process and confirm PIKs

In [29]:
person_linkage_cascade.confirm_piks()

103 input records linked to deceased people, dropping links
0 input records linked to multiple PIKs, dropping links


Unnamed: 0,record_id_raw_input_file,pik,record_id_census_2030,record_id_reference_file,module_name,pass_name,match_probability
0,simulated_census_2030_0,89484,census_2030_preprocessed_0,simulated_geobase_reference_file_951,single_module,single_module_regular,1.000000
1,simulated_census_2030_1,98736,census_2030_preprocessed_1,simulated_geobase_reference_file_17348,single_module,single_module_regular,1.000000
2,simulated_census_2030_10,94481,census_2030_preprocessed_10,simulated_geobase_reference_file_9789,single_module,single_module_regular,0.999999
3,simulated_census_2030_100,100835,census_2030_preprocessed_100,simulated_geobase_reference_file_21248,single_module,single_module_regular,1.000000
4,simulated_census_2030_1000,93179,census_2030_preprocessed_1000,simulated_geobase_reference_file_7496,single_module,single_module_regular,0.999999
...,...,...,...,...,...,...,...
10143,simulated_census_2030_9994,100273,census_2030_preprocessed_9994,simulated_geobase_reference_file_20169,single_module,single_module_regular,0.999995
10144,simulated_census_2030_9996,95280,census_2030_preprocessed_9996,simulated_geobase_reference_file_11208,single_module,single_module_regular,1.000000
10145,simulated_census_2030_9997,101556,census_2030_preprocessed_9997,simulated_geobase_reference_file_22666,single_module,single_module_regular,1.000000
10146,simulated_census_2030_9998,98532,census_2030_preprocessed_9998,simulated_geobase_reference_file_17011,single_module,single_module_regular,1.000000


In [30]:
person_linkage_cascade.confirmed_piks.groupby(["module_name", "pass_name"]).size().sort_values(ascending=False)

module_name    pass_name                   
single_module  single_module_regular           10140
               single_module_switched_names        8
dtype: int64

In [31]:
person_linkage_cascade.confirmed_piks

Unnamed: 0,record_id_raw_input_file,pik,record_id_census_2030,record_id_reference_file,module_name,pass_name,match_probability
0,simulated_census_2030_0,89484,census_2030_preprocessed_0,simulated_geobase_reference_file_951,single_module,single_module_regular,1.000000
1,simulated_census_2030_1,98736,census_2030_preprocessed_1,simulated_geobase_reference_file_17348,single_module,single_module_regular,1.000000
2,simulated_census_2030_10,94481,census_2030_preprocessed_10,simulated_geobase_reference_file_9789,single_module,single_module_regular,0.999999
3,simulated_census_2030_100,100835,census_2030_preprocessed_100,simulated_geobase_reference_file_21248,single_module,single_module_regular,1.000000
4,simulated_census_2030_1000,93179,census_2030_preprocessed_1000,simulated_geobase_reference_file_7496,single_module,single_module_regular,0.999999
...,...,...,...,...,...,...,...
10143,simulated_census_2030_9994,100273,census_2030_preprocessed_9994,simulated_geobase_reference_file_20169,single_module,single_module_regular,0.999995
10144,simulated_census_2030_9996,95280,census_2030_preprocessed_9996,simulated_geobase_reference_file_11208,single_module,single_module_regular,1.000000
10145,simulated_census_2030_9997,101556,census_2030_preprocessed_9997,simulated_geobase_reference_file_22666,single_module,single_module_regular,1.000000
10146,simulated_census_2030_9998,98532,census_2030_preprocessed_9998,simulated_geobase_reference_file_17011,single_module,single_module_regular,1.000000


# Resulting PIKs

In [32]:
pik_values = (
    person_linkage_cascade.confirmed_piks
        .rename(columns={"record_id_raw_input_file": "record_id"})[["record_id", "pik"]]
        .drop_duplicates()
)

In [33]:
census_2030_piked = census_2030_raw_input.copy()
census_2030_piked = census_2030_piked.merge(
    pik_values,
    how="left",
    on="record_id",
    validate="1:1",
)
census_2030_piked

Unnamed: 0,record_id,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,year,pik
0,simulated_census_2030_0,0_8033,Gerald,R,Allen,86,11/03/1943,1130,mallory ln,,Anytown,WA,00000,Household,Reference person,Male,Black,2030,89484
1,simulated_census_2030_1,0_1066,April,S,Hayden,33,10/23/1996,32597,delacorte dr,,Anytown,WA,00000,Household,Other nonrelative,Female,Black,2030,98736
2,simulated_census_2030_2,0_1066,Loretta,T,Lowe,71,06/01/1958,32597,delacorte dr,,Anytown,WA,00000,Household,Reference person,Female,White,2030,91258
3,simulated_census_2030_3,0_2514,Sandra,A,Sorrentino,75,03/18/1954,4458,wibdsor pl,,Anytown,WA,00000,Household,Reference person,Female,Multiracial or Other,2030,90622
4,simulated_census_2030_4,0_5627,Bobby,S,Baker,44,05/20/1985,,winding trail rd,,Anytown,WA,00000,Household,Other nonrelative,Male,White,2030,96379
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11024,simulated_census_2030_11024,0_10778,Jeremy,T,Boyd,46,07/01/1983,211,quiet wsy,,Anytown,WA,00000,Household,Reference person,Male,Black,2030,95963
11025,simulated_census_2030_11025,0_11001,Wendy,M,Gross,54,12/05/1975,2801,blje rdv dr n,,Anytown,WA,00000,Household,Reference person,Female,White,2030,94444
11026,simulated_census_2030_11026,0_5308,Ember,H,Samuels,10,10/26/2019,24113,lauder,,Anytown,WA,00000,Household,Reference person,Female,Black,2030,104164
11027,simulated_census_2030_11027,0_10693,Athena,V,Deshpande,27,07/05/2002,1534,bentley dr,,Anytown,WA,00000,Household,Reference person,Female,Asian,2030,106182


In [34]:
piked_proportion = census_2030_piked.pik.notnull().mean()
# Compare with 90.28% of input records PIKed in the 2010 CUF,
# as reported in Wagner and Layne, Table 2, p. 18 
print(f'{piked_proportion:.2%} of the input records were PIKed')

92.01% of the input records were PIKed


In [35]:
census_2030_piked.to_parquet(f'{output_dir}/{data_to_use}/census_2030_piked.parquet')

In [36]:
person_linkage_cascade.confirmed_piks.to_parquet(f'{output_dir}/{data_to_use}/confirmed_piks.parquet')