# Running Splycer

Here I demonstrate the process of running Splycer from start to finish. First, load in all of the necessary modules.

In [1]:
import sys
import os
import pickle as pkl
import numpy as np
import pandas as pd
from xgboost import XGBClassifier

sys.path.append("R:/JoePriceResearch/record_linking/projects/deep_learning/ml-record-linking/")
from record_set import RecordDataFrame
from pairs_set import PairsCOO
from feature_engineer import FeatureEngineer
from xgboost_match import XGBoostMatch
os.chdir("R:/JoePriceResearch/record_linking/projects/deep_learning/ml-record-linking/example")

This next cell creates RecordDataFrame objects. This is essentially a wrapper for a dataframe object. This wrapper standardizes the way that Splycer retrieves records, allowing you to use any data structure that you want if you create a wrapper for it.

In [2]:
#Load in the census data.
delaware_1910 = RecordDataFrame(2, pd.read_csv("delaware_1910.csv", index_col="index"))
delaware_1920 = RecordDataFrame(3, pd.read_csv("delaware_1920.csv", index_col="index"))

In [3]:
delaware_1910.df["first"] = delaware_1910.df["first"].mask(delaware_1910.df["first"].isnull(), "")
delaware_1920.df["first"] = delaware_1920.df["first"].mask(delaware_1920.df["first"].isnull(), "")
delaware_1910.df["last"] = delaware_1910.df["last"].mask(delaware_1910.df["last"].isnull(), "")
delaware_1920.df["last"] = delaware_1920.df["last"].mask(delaware_1920.df["last"].isnull(), "")

In [4]:
delaware_1910.df["first"].isnull().value_counts()

False    203320
Name: first, dtype: int64

Next, we create the pairs set. This represents post-blocking comparisons that we want to make. Once again, there is a wrapper for any data structure that you want. Here, I am using a COO matrix, which is a compressed matrix representation of compares.

In [5]:
#Load in the compares set
compares = pd.read_csv("delaware_compares.csv")
compares = PairsCOO(2,3, compares.index_1910.values, compares.index_1920.values, np.ones(compares.shape[0]))

Constructing the comparison engineer is the most intensive part of the pipeline. It is constructed by incrementally adding comparisons that you want to perform. For each comparison, you must specify the column name(s) of the record you want to compare; the type of comparison; and any extra arguments that the comparison takes.

Example: suppose I wanted to compare first name using jaro-winkler score. After creating a ComparisonEngineer object `comparison_engine`, the code to do this is `comparison_engine.add_comparison("first_name", "jw")`. If I wanted to also use a commonality weight, I would use `comparison_engine.add_comparison("first_name", "jw", {"comm_weight": 'd', "comm_col": "first_comm"})`.

Below, you can see that I generate the comparison pipeline by creating lists of columns, comparisons, and extra arguments. This is the most efficient way of coding up the comparison pipeline.

In [6]:
#Construct the feature engineer
fe = FeatureEngineer()
#FIXME add parent geo-coordinates
cols = ["marstat", "race", "rel", "mbp", "fbp", "first_sdxn", "last_sdxn", "bp", "county",
        "immigration", "birth_year", ["res_lat", "res_lon"], ["bp_lat", "bp_lon"],
        [f"first_vec{i}" for i in range(2, 202)], [f"last_vec{i}" for i in range(2,202)],
        "first", "last",
        "first", "last"
       ]

col_comps = ["exact match"] * 9
col_comps.extend(["abs dist"] * 2)
col_comps.extend(["euclidean dist"] * 4)
#col_comps.extend(["geo dist"] * 2)
col_comps.extend(["jw"] * 2)
col_comps.extend(["trigram"] * 2)

col_args = list({} for i in range(5))
col_args.extend([{"comm_weight": "d", "comm_col": "first_comm"}, {"comm_weight": "d", "comm_col": "last_comm"},
                 {"comm_weight": "d", "comm_col": "bp_comm"}])
col_args.extend(list({} for i in range(7)))
col_args.extend([{"comm_weight": "d", "comm_col": "first_comm"}, {"comm_weight": "d", "comm_col": "last_comm"}] * 2)
assert len(cols) == len(col_comps) == len(col_args)

In [7]:
for i,j,k in zip(cols, col_comps, col_args):
    fe.add_comparison(i, j, k)

In [8]:
#Construct XGBoostMatch
with open("R:/JoePriceResearch/record_linking/projects/deep_learning/ml-record-linking/model.xgboost", "rb") as file:
    model = pkl.load(file)
model.get_booster().feature_names = [f"f{i}" for i in range(19)]

In [12]:
test = XGBoostMatch(delaware_1910, delaware_1920, compares, fe, model)
test.run("test.csv")

  This is separate from the ipykernel package so we can avoid doing imports until


 

Currently can run 3,237 compares/sec. Choosing a different ngram algorithm will lead to the largest optimizing gains.

In [10]:
fe.pipeline

[<comparisons.BooleanMatch at 0x2b4a83b3630>,
 <comparisons.BooleanMatch at 0x2b4a83b3a20>,
 <comparisons.BooleanMatch at 0x2b4a83b3a90>,
 <comparisons.BooleanMatch at 0x2b4a83b3ac8>,
 <comparisons.BooleanMatch at 0x2b48418ff60>,
 <comparisons.BooleanMatch at 0x2b48418fdd8>,
 <comparisons.BooleanMatch at 0x2b48418fe80>,
 <comparisons.BooleanMatch at 0x2b48418f320>,
 <comparisons.BooleanMatch at 0x2b48418f4e0>,
 <comparisons.AbsDistance at 0x2b48418f748>,
 <comparisons.AbsDistance at 0x2b48418f668>,
 <comparisons.EuclideanDistance at 0x2b48418f550>,
 <comparisons.EuclideanDistance at 0x2b48418fb70>,
 <comparisons.EuclideanDistance at 0x2b4a8281048>,
 <comparisons.EuclideanDistance at 0x2b4a8281160>,
 <comparisons.JW at 0x2b4a82812e8>,
 <comparisons.JW at 0x2b4a82812b0>,
 <comparisons.TriGram at 0x2b4a8281518>,
 <comparisons.TriGram at 0x2b4a82815f8>]

In [26]:
0x00007FFE4099AD30

140729982233904