# Running Splycer

Here I demonstrate the process of running Splycer from start to finish. First, load in all of the necessary modules. In the future, `sys.path.append` will not be necessary since this will be a package built in your Python environment. Make sure you are also in the example folder, which you can change with the `os.chdir()` command listed below.

For code clarity, I turned warnings off in this notebook. Don't do this in your future code because you can get valuable warnings sometimes, especially about invalid values during comparisons.

In [1]:
import sys
import os
import pickle as pkl
import numpy as np
import pandas as pd
from xgboost import XGBClassifier

sys.path.append("R:/JoePriceResearch/record_linking/projects/deep_learning/ml-record-linking/build/lib.win-amd64-3.7")
from record_set import RecordDataFrame
from pairs_set import PairsCOO
from feature_engineer import FeatureEngineer
from xgboost_match import XGBoostMatch
os.chdir("R:/JoePriceResearch/record_linking/projects/deep_learning/ml-record-linking/example")

import warnings
warnings.simplefilter("ignore")

## Record Set Objects

This next cell creates record set objects. A record set object is a container for record information that a linker algorithm can access with a unique identifier. In this tutorial, the record set object's internal structure is a [Pandas DataFrame](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673). One innovation of our code is that it does not matter what your data's internal structure is as long as you use/create a [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) for it. This gives you the freedom to use the data format that is best for your linking project. I will have some documentation for how to do that, but I have created several common data structures to choose from.

(Note: the feather file format I am reading here is a special binary format that Python, R, Julia, etc. can understand. It's extremely fast, so I use it here to avoid long wait times loading the data. However, I wouldn't recommend this format for your projects since other programs can't understand it and it has some limiting design choices.)

In [2]:
delaware_1910 = RecordDataFrame(2, pd.read_feather("delaware_1910.feather", nthreads=4).set_index("index", drop=True))
delaware_1920 = RecordDataFrame(3, pd.read_feather("delaware_1920.feather", nthreads=4).set_index("index", drop=True))

Here's what the data look like. You can see there's some missing values. If I was actually running predictions, I would fix these missing values, and I should fix the missing values in these files in the future.

In [3]:
delaware_1910.df.head()

Unnamed: 0_level_0,marstat,birth_year,household,immigration,race,rel,female,bp,mbp,fbp,...,last_vec192,last_vec193,last_vec194,last_vec195,last_vec196,last_vec197,last_vec198,last_vec199,last_vec200,last_vec201
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8361303,1.0,1874.0,1934610.0,,0,1,1,65,46,36,...,,,,,,,,,,
8437960,0.0,1874.0,1955405.0,,0,-1,1,55,12,55,...,,,,,,,,,,
8387550,0.0,1905.0,1975755.0,,0,3,1,18,18,18,...,,,,,,,,,,
8529805,0.0,1880.0,1954502.0,1880.0,0,-1,1,10,10,10,...,,,,,,,,,,
8529824,0.0,1878.0,1954502.0,,0,-1,1,55,55,55,...,,,,,,,,,,


## Pairs Set Objects

Next, we create the pairs set. This object contains pairs of record ids that we want to predict whether they are a match, indexed by the unique identifier in the record set objects. Once again, internal data structure doesn't matter as long as you use/create a wrapper for your specific data structure. Here, I am using a [coordinate sparse matrix](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)).

In [4]:
compares = pd.read_csv("delaware_compares.csv")
compares = PairsCOO(2,3, compares.index_1910.values, compares.index_1920.values, np.ones(compares.shape[0]))

## Comparing Records with Feature Engineer

Constructing the comparison engineer is the most intensive part of the pipeline. It is constructed by incrementally adding functions that measure record similarity. For each comparison, you must specify the column name(s) of the record you want to compare; the type of comparison; and any extra arguments that the comparison takes. On theme, you can use any similarity score you want as long as you create a wrapper for it.

#### Example

Suppose I wanted to compare first name using jaro-winkler score. After creating a ComparisonEngineer object `comparison_engine`, the code to do this is `comparison_engine.add_comparison("first_name", "jw")`. If I wanted to also use a commonality weight, I would use `comparison_engine.add_comparison("first_name", "jw", {"comm_weight": 'd', "comm_col": "first_comm"})`.

Below, you can see that I generate the comparison pipeline by creating lists of columns, comparisons, and extra arguments. This is the most efficient way of coding up the comparison pipeline.

In [5]:
#Construct the feature engineer
fe = FeatureEngineer()
cols = ["marstat", "race", "rel", "mbp", "fbp", "first_sdxn", "last_sdxn", "bp", "county",
        "immigration", "birth_year", ["res_lat", "res_lon"], ["bp_lat", "bp_lon"],
        [f"first_vec{i}" for i in range(2, 202)], [f"last_vec{i}" for i in range(2,202)],
        "first", "last",
        "first", "last"
       ]

#similarity functions
col_comps = ["exact match"] * 9
col_comps.extend(["abs dist"] * 2)
col_comps.extend(["euclidean dist"] * 4)
#col_comps.extend(["geo dist"] * 2)
col_comps.extend(["jw"] * 2)
col_comps.extend(["trigram"] * 2)

#extra arguments
col_args = list({} for i in range(5))
col_args.extend([{"comm_weight": "d", "comm_col": "first_comm"}, {"comm_weight": "d", "comm_col": "last_comm"},
                 {"comm_weight": "d", "comm_col": "bp_comm"}])
col_args.extend(list({} for i in range(7)))
col_args.extend([{"comm_weight": "d", "comm_col": "first_comm"}, {"comm_weight": "d", "comm_col": "last_comm"}] * 2)
assert len(cols) == len(col_comps) == len(col_args)

I then loop over the lists I created and add the comparisons to my comparison engine.

In [6]:
for i,j,k in zip(cols, col_comps, col_args):
    fe.add_comparison(i, j, k)

Here's a look at all of the available similarity functions I've coded up so far:

In [7]:
list(fe.compares_avail.keys())

['jw',
 'abs dist',
 'euclidean dist',
 'geo dist',
 'bigram',
 'trigram',
 'ngram',
 'exact match']

## Linking Model

Here I load in a model. For now, this is an old model that will give you junk predictions for our current data since I didn't bother cleaning the delaware data. Hopefully this will be improved so that you can see the actual predictions that the model is making.

In [8]:
#Construct XGBoostMatch
with open("R:/JoePriceResearch/record_linking/projects/deep_learning/ml-record-linking/model.xgboost", "rb") as file:
    model = pkl.load(file)
model.get_booster().feature_names = [f"f{i}" for i in range(19)]

This final step is to create and run a linker object. A linker object takes all of the previous parts and uses the similarity scores generated to either predict whether a record pair is a match or return a probability of a record pair being a match. Once again, the code is agnostic to which linking algorithm you use as long as the linking algorithm knows how to work with each of the objects created above.

For this notebook, I use [XGBoost](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/), a powerful adaptation of random forests. The last step is to use `run()` and specify the name of the file that you want your predictions saved to.

In [9]:
xgb = XGBoostMatch(delaware_1910, delaware_1920, compares, fe, model)
xgb.run("test.csv")

Predicting links...: : 700000it [01:11, 11586.79it/s]                                                                  


Linking completed


Currently, we can run 22 million predictions per hour on a single machine, not including blocking and reading in data. 2/3 of the time is spent on the ngram comparison, so dropping that will lead to 66 million predictions per hour. You can test with your own data how much dropping ngram similarity impacts your precision and recall, but it probably won't be that big of a factor since there are many other name comparison metrics that you can use. (If you really need it, you can try optimizing my code for ngrams, which just calls a canned python module. Then submit a pull request to my github repo.)

## Conclusion

And that's the pipeline! The only step not shown is the blocking algorithm. We used SQL to block extremely quickly, but you can do it however you want. Maybe in the future we'll have a dedicated class for blocking, but that is not a priority. Use SQL if you have a very large dataset, or use Stata/Pandas for smaller datasets (merges in Stata are better than Pandas).

In [22]:
%%cython
import numpy as np
cimport numpy as np

cpdef set[tuple] create_ngram_cython(str sequence, int n):
    cdef list seq_list = list(sequence)
    cdef int count = max(0, len(sequence) - n + 1)
    return {tuple(sequence[i:i+n]) for i in range(count)}

cpdef float jaccard_sim_cython(set[tuple] x, set[tuple] y):
    return len(x.intersection(y)) / len(x.union(y))

class NGram():
    def __init__(self, col="first_name", comm_weight=None, comm_col=None, n=3):
        self.col
        self.n = n
        self.create_ngram = np.vectorize(self.create_ngram)
        self.jaccard_similarity = np.vectorize(self.jaccard_similarity)

    def jaccard_similarity(self, x, y):
        """
        intersection_cardinality = len(x.intersection(y))
        union_cardinality = len(x.union(y))
        return intersection_cardinality / float(union_cardinality)
        """
        return jaccard_sim_cython(x, y)

    def create_ngram(self, sequence): #, pad_left=False, pad_right=False, pad_symbol=None):
        """
        if pad_left:
            sequence = chain((pad_symbol,) * (self.n-1), sequence)
        if pad_right:
            sequence = chain(sequence, (pad_symbol,) * (self.n-1))
        """
        """
        sequence = list(sequence)
        count = max(0, len(sequence) - self.n + 1)
        return {tuple(sequence[i:i+self.n]) for i in range(count)}
        """
        return create_ngram_cython(sequence, self.n)

    def compare(self, rec1, rec2):
        return self.jaccard_similarity(self.create_ngram(rec1[self.col]), self.create_ngram(rec2[self.col]))
cpdef main():
    test = create_ngram_cython('test', 2)
    test2 = create_ngram_cython('tsting', 2)
    comp = jaccard_sim_cython(test, test2)
    print(comp)
main()


Error compiling Cython file:
------------------------------------------------------------
...
        return create_ngram_cython(sequence, self.n)

    def compare(self, rec1, rec2):
        return self.jaccard_similarity(self.create_ngram(rec1[self.col]), self.create_ngram(rec2[self.col]))
cpdef main():
    test = create_ngram_cython('test', 2)
   ^
------------------------------------------------------------

C:\Users\ngrasley.BYU\.ipython\cython\_cython_magic_179b56c07f436cec0b27a6be7fbbf8c3.pyx:44:4: Compiler crash in AnalyseExpressionsTransform

ModuleNode.body = StatListNode(_cython_magic_179b56c07f436cec0b27a6be7fbbf8c3.pyx:1:0)
StatListNode.stats[5] = StatListNode(_cython_magic_179b56c07f436cec0b27a6be7fbbf8c3.pyx:43:6)
StatListNode.stats[0] = CFuncDefNode(_cython_magic_179b56c07f436cec0b27a6be7fbbf8c3.pyx:43:6,
    args = [...]/0,
    modifiers = [...]/0,
    overridable = 1,
    visibility = 'private')
File 'Nodes.py', line 436, in analyse_expressions: StatListNode(_cython_m

TypeError: object of type 'NoneType' has no len()