# Source of Code:
This code reproduces the results of the “Statistical supervised meta-ensemble algorithm for medical record linkage” paper. The vast majority of this code was sourced from the original paper’s GitHub repository. The original code has been slightly modified and amended. Specifically, the author's code has been amended to run the experiment 10 times. The mean and standard deviation of the 10 results were recorded.

K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.

# Source of Dataset:
The FEBRL datasets used in this experiment trial was sourced directly from the author's GitHub repository https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/. Specifically the febrl3_UNSW_provided_by_authors.csv file is the febrl3_UNSW.csv file on the authors' GitHub site. Similarly, the febrl4_UNSW_provided_by_authors.csv file is the febrl4_UNSW.csv file on the authors' GitHub site. 

The reason why the  febrl3_UNSW_provided_by_authors.csv and  febrl4_UNSW_provided_by_authors.csv datasets were used in this experiment instead of regenerating the FEBRL dataset with the code provided by the authors' is because the FEBRL dataset is generated using the Python Record Linkage Toolkit library. As a result, the generated dataset is dependent on the version of Python Record Linkage Toolkit library at the time. The current version of the generated dataset is slightly different than the FEBRL datasets published on the authors' GitHub site. When consulting with Jitendra Jonnagaddala, one of the paper's authors, it was stated that a reasonable explanation for this observed difference between the dataset published on the authors' GitHub and the current regeneration of the dataset using the Python Record Linkage Toolkit library was due to changes in the library. The paper was published in 2019 and the most recent change to the library was committed on April 19, 2022. https://github.com/J535D165/recordlinkage

It is expected that different datasets will led to different results. Thus to help eliminate this factor of variation in the results when attempting to reproduce the study, the febrl3_UNSW_provided_by_authors.csv and  febrl4_UNSW_provided_by_authors.csv datasets were used.  These datasets were published on the authors' GitHub and are likely to be the most similar to the datasets used in the original study.

# 1.0 Importing Libraries

In [1]:
%%time
'''
Source: 
K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. 
Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.
'''
import recordlinkage as rl, pandas as pd, numpy as np
from sklearn.model_selection import KFold
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.utils import shuffle
from recordlinkage.preprocessing import phonetic
from numpy.random import choice
import collections, numpy
from IPython.display import clear_output
from sklearn.model_selection import train_test_split, KFold
from math import comb
import statistics
from scipy import stats
import math

CPU times: user 1.25 s, sys: 169 ms, total: 1.42 s
Wall time: 1.66 s


# 2.0 Helper Functions

In [2]:
%%time
'''
Source: 
K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. 
Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.
'''
def generate_true_links(df): 
    # although the match_id column is included in the original df to imply the true links,
    # this function will create the true_link object identical to the true_links properties
    # of recordlinkage toolkit, in order to exploit "Compare.compute()" from that toolkit
    # in extract_function() for extracting features quicker.
    # This process should be deprecated in the future release of the UNSW toolkit.
    df["rec_id"] = df.index.values.tolist()
    indices_1 = []
    indices_2 = []
    processed = 0
    for match_id in df["match_id"].unique():
        if match_id != -1:    
            processed = processed + 1
            # print("In routine generate_true_links(), count =", processed)
            # clear_output(wait=True)
            linkages = df.loc[df['match_id'] == match_id]
            for j in range(len(linkages)-1):
                for k in range(j+1, len(linkages)):
                    indices_1 = indices_1 + [linkages.iloc[j]["rec_id"]]
                    indices_2 = indices_2 + [linkages.iloc[k]["rec_id"]]    
    links = pd.MultiIndex.from_arrays([indices_1,indices_2])
    return links

def generate_false_links(df, size):
    # A counterpart of generate_true_links(), with the purpose to generate random false pairs
    # for training. The number of false pairs in specified as "size".
    df["rec_id"] = df.index.values.tolist()
    indices_1 = []
    indices_2 = []
    unique_match_id = df["match_id"].unique()
    for j in range(size):
            false_pair_ids = choice(unique_match_id, 2)
            candidate_1_cluster = df.loc[df['match_id'] == false_pair_ids[0]]
            candidate_1 = candidate_1_cluster.iloc[choice(range(len(candidate_1_cluster)))]
            candidate_2_cluster = df.loc[df['match_id'] == false_pair_ids[1]]
            candidate_2 = candidate_2_cluster.iloc[choice(range(len(candidate_2_cluster)))]    
            indices_1 = indices_1 + [candidate_1["rec_id"]]
            indices_2 = indices_2 + [candidate_2["rec_id"]]  
    links = pd.MultiIndex.from_arrays([indices_1,indices_2])
    return links

def swap_fields_flag(f11, f12, f21, f22):
    return int((f11 == f22) and (f12 == f21))

def extract_features(df, links):
    c = rl.Compare()
    c.string('given_name', 'given_name', method='jarowinkler', label='y_name')
    c.string('given_name_soundex', 'given_name_soundex', method='jarowinkler', label='y_name_soundex')
    c.string('given_name_nysiis', 'given_name_nysiis', method='jarowinkler', label='y_name_nysiis')
    c.string('surname', 'surname', method='jarowinkler', label='y_surname')
    c.string('surname_soundex', 'surname_soundex', method='jarowinkler', label='y_surname_soundex')
    c.string('surname_nysiis', 'surname_nysiis', method='jarowinkler', label='y_surname_nysiis')
    c.exact('street_number', 'street_number', label='y_street_number')
    c.string('address_1', 'address_1', method='levenshtein', threshold=0.7, label='y_address1')
    c.string('address_2', 'address_2', method='levenshtein', threshold=0.7, label='y_address2')
    c.exact('postcode', 'postcode', label='y_postcode')
    c.exact('day', 'day', label='y_day')
    c.exact('month', 'month', label='y_month')
    c.exact('year', 'year', label='y_year')
        
    # Build features
    feature_vectors = c.compute(links, df, df)
    return feature_vectors

def generate_train_X_y(df):
    # This routine is to generate the feature vector X and the corresponding labels y
    # with exactly equal number of samples for both classes to train the classifier.
    pos = extract_features(df, train_true_links)
    train_false_links = generate_false_links(df, len(train_true_links))    
    neg = extract_features(df, train_false_links)
    X = pos.values.tolist() + neg.values.tolist()
    y = [1]*len(pos)+[0]*len(neg)
    X, y = shuffle(X, y, random_state=0)
    X = np.array(X)
    y = np.array(y)
    return X, y

def train_model(modeltype, modelparam, train_vectors, train_labels, modeltype_2):
    if modeltype == 'svm': # Support Vector Machine
        model = svm.SVC(C = modelparam, kernel = modeltype_2)
        model.fit(train_vectors, train_labels) 
    elif modeltype == 'lg': # Logistic Regression
        model = LogisticRegression(C=modelparam, penalty = modeltype_2,class_weight=None, dual=False, fit_intercept=True, 
                                   intercept_scaling=1, max_iter=5000, multi_class='ovr', 
                                   n_jobs=1, random_state=None)
        model.fit(train_vectors, train_labels)
    elif modeltype == 'nb': # Naive Bayes
        model = GaussianNB()
        model.fit(train_vectors, train_labels)
    elif modeltype == 'nn': # Neural Network
        model = MLPClassifier(solver='lbfgs', alpha=modelparam, hidden_layer_sizes=(256, ), 
                              activation = modeltype_2,random_state=None, batch_size='auto', 
                              learning_rate='constant',  learning_rate_init=0.001, 
                              power_t=0.5, max_iter=10000, shuffle=True, 
                              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
                              nesterovs_momentum=True, early_stopping=False, 
                              validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        model.fit(train_vectors, train_labels)
    return model

def classify(model, test_vectors):
    result = model.predict(test_vectors)
    return result

    
def evaluation(test_labels, result):
    true_pos = np.logical_and(test_labels, result)
    count_true_pos = np.sum(true_pos)
    true_neg = np.logical_and(np.logical_not(test_labels),np.logical_not(result))
    count_true_neg = np.sum(true_neg)
    false_pos = np.logical_and(np.logical_not(test_labels), result)
    count_false_pos = np.sum(false_pos)
    false_neg = np.logical_and(test_labels,np.logical_not(result))
    count_false_neg = np.sum(false_neg)
    precision = count_true_pos/(count_true_pos+count_false_pos)
    sensitivity = count_true_pos/(count_true_pos+count_false_neg) # sensitivity = recall
    confusion_matrix = [count_true_pos, count_false_pos, count_false_neg, count_true_neg]
    no_links_found = np.count_nonzero(result)
    no_false = count_false_pos + count_false_neg
    Fscore = 2*precision*sensitivity/(precision+sensitivity)
    metrics_result = {'no_false':no_false, 'confusion_matrix':confusion_matrix ,'precision':precision,
                     'sensitivity':sensitivity ,'no_links':no_links_found, 'F-score': Fscore}
    return metrics_result

def blocking_performance(candidates, true_links, df):
    count = 0
    for candi in candidates:
        if df.loc[candi[0]]["match_id"]==df.loc[candi[1]]["match_id"]:
            count = count + 1
    return count

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 6.2 µs


# 3.0 Running the Experiment 10 Times

In [3]:
%%time
FEBRL_surname_nc = []
FEBRL_surname_pc = []
FEBRL_surname_rr = []
FEBRL_given_name_nc = []
FEBRL_given_name_pc = []
FEBRL_given_name_rr = []
FEBRL_postcode_nc = []
FEBRL_postcode_pc = []
FEBRL_postcode_rr = []
FEBRL_all_nc = []
FEBRL_all_pc = []
FEBRL_all_rr = []

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.96 µs


In [4]:
%%time
FEBRL_svm_pr = []
FEBRL_svm_re = []
FEBRL_svm_fs = []
FEBRL_svm_fc = []
FEBRL_svm_bag_pr = []
FEBRL_svm_bag_re = []
FEBRL_svm_bag_fs = []
FEBRL_svm_bag_fc = []
FEBRL_nn_pr = []
FEBRL_nn_re = []
FEBRL_nn_fs = []
FEBRL_nn_fc = []
FEBRL_nn_bag_pr = []
FEBRL_nn_bag_re = []
FEBRL_nn_bag_fs = []
FEBRL_nn_bag_fc = []
FEBRL_lr_pr = []
FEBRL_lr_re = []
FEBRL_lr_fs = []
FEBRL_lr_fc = []
FEBRL_lr_bag_pr = []
FEBRL_lr_bag_re = []
FEBRL_lr_bag_fs = []
FEBRL_lr_bag_fc = []
FEBRL_ensemble_pr = []
FEBRL_ensemble_re = []
FEBRL_ensemble_fs = []
FEBRL_ensemble_fc = []

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.06 µs


In [5]:
%%time
for i in range(10):
    print("")
    print("ITERATION: ", i)
    print("")

    trainset = 'febrl4_UNSW_provided_by_authors'
    testset = 'febrl4_UNSW_provided_by_authors'
    
    # 1. Preparing the FEBRL dataset #################################################################################
    print("Preparing the FEBRL dataset")
    '''
    Source: 
    K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. 
    Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.
    '''
    # Import
    print("Import train set...")
    df_train = pd.read_csv(trainset+".csv", index_col = "rec_id")
    train_true_links = generate_true_links(df_train)
    print("Train set size:", len(df_train), ", number of matched pairs: ", str(len(train_true_links)))

    # Preprocess train set
    df_train['postcode'] = df_train['postcode'].astype(str)
    df_train['given_name_soundex'] = phonetic(df_train['given_name'], method='soundex')
    df_train['given_name_nysiis'] = phonetic(df_train['given_name'], method='nysiis')
    df_train['surname_soundex'] = phonetic(df_train['surname'], method='soundex')
    df_train['surname_nysiis'] = phonetic(df_train['surname'], method='nysiis')

    # Final train feature vectors and labels
    X_train, y_train = generate_train_X_y(df_train)
    print("Finished building X_train, y_train")
    
    # 2. FEBRL Blocking Results ######################################################################################
    print("FEBRL Blocking Results")
    '''
    Source: 
    K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. 
    Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.

    Code has been modified to reproduce and print Table 4 of the paper.
    '''
    # Blocking Criteria: declare non-match of all of the below fields disagree
    # Import
    print("Import test set...")
    FEBRL_blocking_results = []
    df_test = pd.read_csv(testset+".csv", index_col = "rec_id")
    test_true_links = generate_true_links(df_test)
    leng_test_true_links = len(test_true_links)
    print("Test set size:", len(df_test), ", number of matched pairs: ", str(leng_test_true_links))

    total_possible_pairs = comb(len(df_test),2)
    match_pairs = leng_test_true_links

    print("BLOCKING PERFORMANCE:")
    blocking_fields = ["given_name", "surname", "postcode"]
    all_candidate_pairs = []
    for field in blocking_fields:
        block_indexer = rl.BlockIndex(on=field)
        candidates = block_indexer.index(df_test)
        detects = blocking_performance(candidates, test_true_links, df_test)
        all_candidate_pairs = candidates.union(all_candidate_pairs)
        print("Number of pairs of matched "+ field +": "+str(len(candidates)), ", detected ",
             detects,'/'+ str(leng_test_true_links) + " true matched pairs, missed " + 
              str(leng_test_true_links-detects) )
        
        # recording results for iteration
        if field == 'given_name':
            FEBRL_given_name_nc.append(len(candidates))
            FEBRL_given_name_pc.append(detects/match_pairs*100.0)
            FEBRL_given_name_rr.append((1-(len(candidates)/1.0/total_possible_pairs))*100)
        if field == 'surname':
            FEBRL_surname_nc.append(len(candidates))
            FEBRL_surname_pc.append(detects/match_pairs*100.0)
            FEBRL_surname_rr.append((1-(len(candidates)/1.0/total_possible_pairs))*100)
        if field == 'postcode':
            FEBRL_postcode_nc.append(len(candidates))
            FEBRL_postcode_pc.append(detects/match_pairs*100.0)
            FEBRL_postcode_rr.append((1-(len(candidates)/1.0/total_possible_pairs))*100)  

    detects = blocking_performance(all_candidate_pairs, test_true_links, df_test)
    print("Number of pairs of at least 1 field matched: " + str(len(all_candidate_pairs)), ", detected ",
         detects,'/'+ str(leng_test_true_links) + " true matched pairs, missed " + 
              str(leng_test_true_links-detects) )
    
    # recording results for iteration
    FEBRL_all_nc.append(len(all_candidate_pairs))
    FEBRL_all_pc.append(detects/match_pairs*100.0)
    FEBRL_all_rr.append((1-(len(candidates)/1.0/total_possible_pairs))*100)
    
    # 3. FEBRL Classification Performance Results ####################################################################
    print("FEBRL Classification Performance Results")
    '''
    Source: 
    K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. 
    Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.
    '''
    ## TEST SET CONSTRUCTION
    # Preprocess test set
    print("Processing test set...")
    print("Preprocess...")
    df_test['postcode'] = df_test['postcode'].astype(str)
    df_test['given_name_soundex'] = phonetic(df_test['given_name'], method='soundex')
    df_test['given_name_nysiis'] = phonetic(df_test['given_name'], method='nysiis')
    df_test['surname_soundex'] = phonetic(df_test['surname'], method='soundex')
    df_test['surname_nysiis'] = phonetic(df_test['surname'], method='nysiis')

    # Test feature vectors and labels construction
    print("Extract feature vectors...")
    df_X_test = extract_features(df_test, all_candidate_pairs)
    vectors = df_X_test.values.tolist()
    labels = [0]*len(vectors)
    feature_index = df_X_test.index
    for i in range(0, len(feature_index)):
        if df_test.loc[feature_index[i][0]]["match_id"]==df_test.loc[feature_index[i][1]]["match_id"]:
            labels[i] = 1
    X_test, y_test = shuffle(vectors, labels, random_state=0)
    X_test = np.array(X_test)
    y_test = np.array(y_test)
    print("Count labels of y_test:",collections.Counter(y_test))
    print("Finished building X_test, y_test")

    '''
    Modifying the code provided by the authors to produce the results in Table 6 of the paper. 
    Used the hyperparameters as specified by Table 5 of the paper to build the models.

    Source: 
    K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. 
    Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.
    '''
    # 3.1 SVM BASE LEARNERS CLASSIFICATION AND EVALUATION ############################################################
    '''
    Table 5 Hyperparameters for SVM on the FEBRL dataset
    1. Linear kernel
    2. C = 0.005
    '''
    modeltype = 'svm' # choose between 'svm', 'lg', 'nn'
    modeltype_2 = 'linear'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
    modelparam = 0.005

    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision = final_eval['precision']
    sensitivity = final_eval['sensitivity']
    Fscore = final_eval['F-score']
    nb_false  = final_eval['no_false']
    
    FEBRL_svm_pr.append(precision)
    FEBRL_svm_re.append(sensitivity)
    FEBRL_svm_fs.append(Fscore)
    FEBRL_svm_fc.append(nb_false)

    # 3.2 NN BASE LEARNERS CLASSIFICATION AND EVALUATION #############################################################
    '''
    Table 5 Hyperparameters for NN on the FEBRL dataset
    1. ReLu activation with a = 100
    '''
    modeltype = 'nn' # choose between 'svm', 'lg', 'nn'
    modeltype_2 = 'relu'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
    modelparam = 100

    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision = final_eval['precision']
    sensitivity = final_eval['sensitivity']
    Fscore = final_eval['F-score']
    nb_false = final_eval['no_false']
    
    FEBRL_nn_pr.append(precision)
    FEBRL_nn_re.append(sensitivity)
    FEBRL_nn_fs.append(Fscore)
    FEBRL_nn_fc.append(nb_false)

    # 3.3 LR BASE LEARNERS CLASSIFICATION AND EVALUATION #############################################################
    '''
    Table 5 Hyperparameters for NN on the FEBRL dataset
    1. Regularization I2
    2. C = 0.2
    '''
    modeltype = 'lg' # choose between 'svm', 'lg', 'nn'
    modeltype_2 = 'l2'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
    modelparam = 0.2

    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision = final_eval['precision']
    sensitivity = final_eval['sensitivity']
    Fscore = final_eval['F-score']
    nb_false = final_eval['no_false']
    
    FEBRL_lr_pr.append(precision)
    FEBRL_lr_re.append(sensitivity)
    FEBRL_lr_fs.append(Fscore)
    FEBRL_lr_fc.append(nb_false)
    
    # 3.4 BAGGING BASE LEARNERS CLASSIFICATION AND EVALUATION ########################################################
    modeltypes = ['svm', 'nn', 'lg'] 
    modeltypes_2 = ['linear', 'relu', 'l2']
    modelparams = [0.005, 100, 0.2]
    nFold = 10
    kf = KFold(n_splits=nFold)
    model_raw_score = [0]*3
    model_binary_score = [0]*3
    model_i = 0
    for model_i in range(3):
        modeltype = modeltypes[model_i]
        modeltype_2 = modeltypes_2[model_i]
        modelparam = modelparams[model_i]
        # print(modeltype, "per fold:")
        iFold = 0
        result_fold = [0]*nFold
        final_eval_fold = [0]*nFold
        for train_index, valid_index in kf.split(X_train):
            X_train_fold = X_train[train_index]
            y_train_fold = y_train[train_index]
            md =  train_model(modeltype, modelparam, X_train_fold, y_train_fold, modeltype_2)
            result_fold[iFold] = classify(md, X_test)
            final_eval_fold[iFold] = evaluation(y_test, result_fold[iFold])
            # print("Fold", str(iFold), final_eval_fold[iFold])
            iFold = iFold + 1
        bagging_raw_score = np.average(result_fold, axis=0)
        bagging_binary_score  = np.copy(bagging_raw_score)
        bagging_binary_score[bagging_binary_score > 0.5] = 1
        bagging_binary_score[bagging_binary_score <= 0.5] = 0
        bagging_eval = evaluation(y_test, bagging_binary_score)
        # print(modeltype, "bagging:", bagging_eval)
        # print('')

        if modeltype == 'svm':
            FEBRL_svm_bag_pr.append(bagging_eval['precision'])
            FEBRL_svm_bag_re.append(bagging_eval['sensitivity'])
            FEBRL_svm_bag_fs.append(bagging_eval['F-score'])
            FEBRL_svm_bag_fc.append(bagging_eval['no_false'])
        elif modeltype == 'nn':
            FEBRL_nn_bag_pr.append(bagging_eval['precision'])
            FEBRL_nn_bag_re.append(bagging_eval['sensitivity'])
            FEBRL_nn_bag_fs.append(bagging_eval['F-score'])
            FEBRL_nn_bag_fc.append(bagging_eval['no_false'])   
        elif modeltype == 'lg':
            FEBRL_lr_bag_pr.append(bagging_eval['precision'])
            FEBRL_lr_bag_re.append(bagging_eval['sensitivity'])
            FEBRL_lr_bag_fs.append(bagging_eval['F-score'])
            FEBRL_lr_bag_fc.append(bagging_eval['no_false'])

        model_raw_score[model_i] = bagging_raw_score
        model_binary_score[model_i] = bagging_binary_score
        
    # 4 Ensemble Model Performance ###################################################################################
    '''
    Source: 
    K. Vo, J. Jonnagaddala and S.-T. Liaw, "Medical-Record-Linkage-Ensemble," 16 February 2019. [Online]. 
    Available: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble/.
    '''
    thres = .99

    stack_raw_score = np.average(model_raw_score, axis=0)
    stack_binary_score = np.copy(stack_raw_score)
    stack_binary_score[stack_binary_score > thres] = 1
    stack_binary_score[stack_binary_score <= thres] = 0
    stacking_eval = evaluation(y_test, stack_binary_score)
    
    FEBRL_ensemble_pr.append(stacking_eval['precision'])
    FEBRL_ensemble_re.append(stacking_eval['sensitivity'])
    FEBRL_ensemble_fs.append(stacking_eval['F-score'])
    FEBRL_ensemble_fc.append(stacking_eval['no_false'])
    


ITERATION:  0

Preparing the FEBRL dataset
Import train set...
Train set size: 10000 , number of matched pairs:  5000
Finished building X_train, y_train
FEBRL Blocking Results
Import test set...
Test set size: 10000 , number of matched pairs:  5000
BLOCKING PERFORMANCE:
Number of pairs of matched given_name: 154898 , detected  3287 /5000 true matched pairs, missed 1713
Number of pairs of matched surname: 170843 , detected  3325 /5000 true matched pairs, missed 1675
Number of pairs of matched postcode: 53197 , detected  4219 /5000 true matched pairs, missed 781
Number of pairs of at least 1 field matched: 372073 , detected  4894 /5000 true matched pairs, missed 106
FEBRL Classification Performance Results
Processing test set...
Preprocess...
Extract feature vectors...
Count labels of y_test: Counter({0: 367179, 1: 4894})
Finished building X_test, y_test

ITERATION:  1

Preparing the FEBRL dataset
Import train set...
Train set size: 10000 , number of matched pairs:  5000
Finished buildi

Number of pairs of matched postcode: 53197 , detected  4219 /5000 true matched pairs, missed 781
Number of pairs of at least 1 field matched: 372073 , detected  4894 /5000 true matched pairs, missed 106
FEBRL Classification Performance Results
Processing test set...
Preprocess...
Extract feature vectors...
Count labels of y_test: Counter({0: 367179, 1: 4894})
Finished building X_test, y_test
CPU times: user 1h 6min 10s, sys: 1min 52s, total: 1h 8min 2s
Wall time: 51min 42s


# 4.0 Results: Creating the Paper’s Table 4

## 4.1 Mean of blocking performance after 10 runs

In [6]:
%%time
results = []
results.append(sum(FEBRL_surname_nc) / float(len(FEBRL_surname_nc)))
results.append(sum(FEBRL_surname_pc) / float(len(FEBRL_surname_pc)))
results.append(sum(FEBRL_surname_rr) / float(len(FEBRL_surname_rr)))
results.append(sum(FEBRL_given_name_nc) / float(len(FEBRL_given_name_nc)))
results.append(sum(FEBRL_given_name_pc) / float(len(FEBRL_given_name_pc)))
results.append(sum(FEBRL_given_name_rr) / float(len(FEBRL_given_name_rr)))
results.append(sum(FEBRL_postcode_nc) / float(len(FEBRL_postcode_nc)))
results.append(sum(FEBRL_postcode_pc) / float(len(FEBRL_postcode_pc)))
results.append(sum(FEBRL_postcode_rr) / float(len(FEBRL_postcode_rr)))
results.append(sum(FEBRL_all_nc) / float(len(FEBRL_all_nc)))
results.append(sum(FEBRL_all_pc) / float(len(FEBRL_all_pc)))
results.append(sum(FEBRL_all_rr) / float(len(FEBRL_all_rr)))

blocking_criterion = ['Surname', 'Surname', 'Surname', 
                      'Given name', 'Given name', 'Given name',
                      'Postcode', 'Postcode', 'Postcode',
                      'All', 'All', 'All']
measure = ['nc', 'pc', 'rr',
           'nc', 'pc', 'rr',
           'nc', 'pc', 'rr',
           'nc', 'pc', 'rr']

CPU times: user 20 µs, sys: 4 µs, total: 24 µs
Wall time: 25 µs


In [7]:
%%time
blocking_results = pd.DataFrame(blocking_criterion, columns=['Blocking Criterion'])
blocking_results['Measure'] = measure
blocking_results['FEBRL Results (Mean of 10 Runs)'] = results

CPU times: user 40.5 ms, sys: 3.24 ms, total: 43.7 ms
Wall time: 5.61 ms


In [8]:
blocking_results

Unnamed: 0,Blocking Criterion,Measure,FEBRL Results (Mean of 10 Runs)
0,Surname,nc,170843.0
1,Surname,pc,66.5
2,Surname,rr,99.65828
3,Given name,nc,154898.0
4,Given name,pc,65.74
5,Given name,rr,99.690173
6,Postcode,nc,53197.0
7,Postcode,pc,84.38
8,Postcode,rr,99.893595
9,All,nc,372073.0


## 4.2 STD of blocking performance after 10 runs

In [9]:
%%time
print("FEBRL_surname_nc STD: ", statistics.pstdev(FEBRL_surname_nc)) 
print("FEBRL_surname_pc STD: ", statistics.pstdev(FEBRL_surname_pc)) 
print("FEBRL_surname_rr STD: ", statistics.pstdev(FEBRL_surname_rr)) 
print("FEBRL_given_name_nc STD: ", statistics.pstdev(FEBRL_given_name_nc)) 
print("FEBRL_given_name_pc STD: ", statistics.pstdev(FEBRL_given_name_pc)) 
print("FEBRL_given_name_rr STD: ", statistics.pstdev(FEBRL_given_name_rr)) 
print("FEBRL_postcode_nc STD: ", statistics.pstdev(FEBRL_postcode_nc)) 
print("FEBRL_postcode_pc STD: ", statistics.pstdev(FEBRL_postcode_pc)) 
print("FEBRL_postcode_rr STD: ", statistics.pstdev(FEBRL_postcode_rr)) 
print("FEBRL_all_nc STD: ", statistics.pstdev(FEBRL_all_nc)) 
print("FEBRL_all_pc STD: ", statistics.pstdev(FEBRL_all_pc))
print("FEBRL_all_rr STD: ", statistics.pstdev(FEBRL_all_rr)) 

FEBRL_surname_nc STD:  0.0
FEBRL_surname_pc STD:  0.0
FEBRL_surname_rr STD:  0.0
FEBRL_given_name_nc STD:  0.0
FEBRL_given_name_pc STD:  0.0
FEBRL_given_name_rr STD:  0.0
FEBRL_postcode_nc STD:  0.0
FEBRL_postcode_pc STD:  0.0
FEBRL_postcode_rr STD:  0.0
FEBRL_all_nc STD:  0.0
FEBRL_all_pc STD:  0.0
FEBRL_all_rr STD:  0.0
CPU times: user 13 ms, sys: 621 µs, total: 13.7 ms
Wall time: 1.69 ms


# 5.0 Results: Creating the Paper’s Table 6

## 5.1 Mean of classification performance after 10 runs

In [10]:
%%time
pr_col_MEAN = []
pr_col_MEAN.append(sum(FEBRL_svm_pr) / float(len(FEBRL_svm_pr)))
pr_col_MEAN.append(sum(FEBRL_svm_bag_pr) / float(len(FEBRL_svm_bag_pr)))
pr_col_MEAN.append(sum(FEBRL_nn_pr) / float(len(FEBRL_nn_pr)))
pr_col_MEAN.append(sum(FEBRL_nn_bag_pr) / float(len(FEBRL_nn_bag_pr)))
pr_col_MEAN.append(sum(FEBRL_lr_pr) / float(len(FEBRL_lr_pr)))
pr_col_MEAN.append(sum(FEBRL_lr_bag_pr) / float(len(FEBRL_lr_bag_pr)))
pr_col_MEAN.append(sum(FEBRL_ensemble_pr) / float(len(FEBRL_ensemble_pr)))

re_col_MEAN = []
re_col_MEAN.append(sum(FEBRL_svm_re) / float(len(FEBRL_svm_re)))
re_col_MEAN.append(sum(FEBRL_svm_bag_re) / float(len(FEBRL_svm_bag_re)))
re_col_MEAN.append(sum(FEBRL_nn_re) / float(len(FEBRL_nn_re)))
re_col_MEAN.append(sum(FEBRL_nn_bag_re) / float(len(FEBRL_nn_bag_re)))
re_col_MEAN.append(sum(FEBRL_lr_re) / float(len(FEBRL_lr_re)))
re_col_MEAN.append(sum(FEBRL_lr_bag_re) / float(len(FEBRL_lr_bag_re)))
re_col_MEAN.append(sum(FEBRL_ensemble_re) / float(len(FEBRL_ensemble_re)))

fs_col_MEAN = []
fs_col_MEAN.append(sum(FEBRL_svm_fs) / float(len(FEBRL_svm_fs)))
fs_col_MEAN.append(sum(FEBRL_svm_bag_fs) / float(len(FEBRL_svm_bag_fs)))
fs_col_MEAN.append(sum(FEBRL_nn_fs) / float(len(FEBRL_nn_fs)))
fs_col_MEAN.append(sum(FEBRL_nn_bag_fs) / float(len(FEBRL_nn_bag_fs)))
fs_col_MEAN.append(sum(FEBRL_lr_fs) / float(len(FEBRL_lr_fs)))
fs_col_MEAN.append(sum(FEBRL_lr_bag_fs) / float(len(FEBRL_lr_bag_fs)))
fs_col_MEAN.append(sum(FEBRL_ensemble_fs) / float(len(FEBRL_ensemble_fs)))

fc_col_MEAN = []
fc_col_MEAN.append(sum(FEBRL_svm_fc) / float(len(FEBRL_svm_fc)))
fc_col_MEAN.append(sum(FEBRL_svm_bag_fc) / float(len(FEBRL_svm_bag_fc)))
fc_col_MEAN.append(sum(FEBRL_nn_fc) / float(len(FEBRL_nn_fc)))
fc_col_MEAN.append(sum(FEBRL_nn_bag_fc) / float(len(FEBRL_nn_bag_fc)))
fc_col_MEAN.append(sum(FEBRL_lr_fc) / float(len(FEBRL_lr_fc)))
fc_col_MEAN.append(sum(FEBRL_lr_bag_fc) / float(len(FEBRL_lr_bag_fc)))
fc_col_MEAN.append(sum(FEBRL_ensemble_fc) / float(len(FEBRL_ensemble_fc)))

CPU times: user 849 µs, sys: 5 µs, total: 854 µs
Wall time: 111 µs


In [11]:
%%time
models = ['SVM', 'SVM-bag', 'NN', 'NN-bag', 'LR', 'LR-bag', 'Stack+Bag']
df_means = pd.DataFrame(models, columns=['Model'])
df_means['pr(%)'] = pr_col_MEAN
df_means['pr(%)'] = df_means['pr(%)']*100
df_means['re(%)'] = re_col_MEAN
df_means['re(%)'] = df_means['re(%)']*100
df_means['fs(%)'] = fs_col_MEAN
df_means['fs(%)'] = df_means['fs(%)']*100
df_means['fc'] = fc_col_MEAN

CPU times: user 8.48 ms, sys: 234 µs, total: 8.72 ms
Wall time: 2.66 ms


In [12]:
df_means

Unnamed: 0,Model,pr(%),re(%),fs(%),fc
0,SVM,89.54254,99.775235,94.382234,581.3
1,SVM-bag,89.534456,99.775235,94.37771,581.8
2,NN,93.5342,99.754802,96.544316,349.5
3,NN-bag,93.737043,99.754802,96.652281,338.2
4,LR,89.28374,99.765018,94.233058,597.7
5,LR-bag,89.932994,99.762975,94.592554,558.3
6,Stack+Bag,94.602185,99.754802,97.109992,290.6


## 5.2 STD of classification performance after 10 runs

In [13]:
%%time
pr_col_STD = []
pr_col_STD.append(statistics.pstdev(FEBRL_svm_pr))
pr_col_STD.append(statistics.pstdev(FEBRL_svm_bag_pr))
pr_col_STD.append(statistics.pstdev(FEBRL_nn_pr))
pr_col_STD.append(statistics.pstdev(FEBRL_nn_bag_pr))
pr_col_STD.append(statistics.pstdev(FEBRL_lr_pr))
pr_col_STD.append(statistics.pstdev(FEBRL_lr_bag_pr))
pr_col_STD.append(statistics.pstdev(FEBRL_ensemble_pr))

re_col_STD = []
re_col_STD.append(statistics.pstdev(FEBRL_svm_re))
re_col_STD.append(statistics.pstdev(FEBRL_svm_bag_re))
re_col_STD.append(statistics.pstdev(FEBRL_nn_re))
re_col_STD.append(statistics.pstdev(FEBRL_nn_bag_re))
re_col_STD.append(statistics.pstdev(FEBRL_lr_re))
re_col_STD.append(statistics.pstdev(FEBRL_lr_bag_re))
re_col_STD.append(statistics.pstdev(FEBRL_ensemble_re))

fs_col_STD = []
fs_col_STD.append(statistics.pstdev(FEBRL_svm_fs))
fs_col_STD.append(statistics.pstdev(FEBRL_svm_bag_fs))
fs_col_STD.append(statistics.pstdev(FEBRL_nn_fs))
fs_col_STD.append(statistics.pstdev(FEBRL_nn_bag_fs))
fs_col_STD.append(statistics.pstdev(FEBRL_lr_fs))
fs_col_STD.append(statistics.pstdev(FEBRL_lr_bag_fs))
fs_col_STD.append(statistics.pstdev(FEBRL_ensemble_fs))

fc_col_STD = []
fc_col_STD.append(statistics.pstdev(FEBRL_svm_fc))
fc_col_STD.append(statistics.pstdev(FEBRL_svm_bag_fc))
fc_col_STD.append(statistics.pstdev(FEBRL_nn_fc))
fc_col_STD.append(statistics.pstdev(FEBRL_nn_bag_fc))
fc_col_STD.append(statistics.pstdev(FEBRL_lr_fc))
fc_col_STD.append(statistics.pstdev(FEBRL_lr_bag_fc))
fc_col_STD.append(statistics.pstdev(FEBRL_ensemble_fc))

CPU times: user 39 ms, sys: 307 µs, total: 39.3 ms
Wall time: 4.93 ms


In [14]:
%%time
df_STD = pd.DataFrame(models, columns=['Model'])
df_STD['pr(%)'] = pr_col_STD
df_STD['pr(%)'] = df_STD['pr(%)']*100
df_STD['re(%)'] = re_col_STD
df_STD['re(%)'] = df_STD['re(%)']*100
df_STD['fs(%)'] = fs_col_STD
df_STD['fs(%)'] = df_STD['fs(%)']*100
df_STD['fc'] = fc_col_STD

CPU times: user 23.2 ms, sys: 1.2 ms, total: 24.4 ms
Wall time: 3.23 ms


In [15]:
df_STD

Unnamed: 0,Model,pr(%),re(%),fs(%),fc
0,SVM,0.19588,0.0,0.108686,11.874342
1,SVM-bag,0.222846,0.0,0.123597,13.490738
2,NN,0.177631,0.0,0.094576,9.899495
3,NN-bag,0.150931,0.0,0.080242,8.3666
4,LR,0.50865,0.010217,0.283564,31.288976
5,LR-bag,0.499466,0.01001,0.275485,30.149627
6,Stack+Bag,0.271923,0.0,0.143299,14.832397


## 5.3 Comparing if the paper's results for classification performance fall within two standard deviations of the reproduced results after 10 runs

In [16]:
%%time
df_lower_and_upper = pd.DataFrame(models, columns=['Model'])
df_lower_and_upper['pr(%)_lower'] = df_means['pr(%)'] - 2 * df_STD['pr(%)']
df_lower_and_upper['pr(%)_uppper'] = df_means['pr(%)'] + 2 * df_STD['pr(%)']    
df_lower_and_upper['re(%)_lower'] = df_means['re(%)'] - 2 * df_STD['re(%)']
df_lower_and_upper['re(%)_uppper'] = df_means['re(%)'] + 2 * df_STD['re(%)'] 
df_lower_and_upper['fs(%)_lower'] = df_means['fs(%)'] - 2 * df_STD['fs(%)']
df_lower_and_upper['fs(%)_uppper'] = df_means['fs(%)'] + 2 * df_STD['fs(%)']
df_lower_and_upper['fc_lower'] = df_means['fc'] - 2 * df_STD['fc']
df_lower_and_upper['fc_uppper'] = df_means['fc'] + 2 * df_STD['fc']

CPU times: user 46.2 ms, sys: 1.14 ms, total: 47.3 ms
Wall time: 5.87 ms


In [17]:
df_lower_and_upper

Unnamed: 0,Model,pr(%)_lower,pr(%)_uppper,re(%)_lower,re(%)_uppper,fs(%)_lower,fs(%)_uppper,fc_lower,fc_uppper
0,SVM,89.150779,89.9343,99.775235,99.775235,94.164862,94.599607,557.551316,605.048684
1,SVM-bag,89.088763,89.980148,99.775235,99.775235,94.130517,94.624904,554.818525,608.781475
2,NN,93.178939,93.889462,99.754802,99.754802,96.355163,96.733468,329.70101,369.29899
3,NN-bag,93.435181,94.038905,99.754802,99.754802,96.491797,96.812765,321.466799,354.933201
4,LR,88.26644,90.301039,99.744585,99.785452,93.66593,94.800185,535.122049,660.277951
5,LR-bag,88.934063,90.931925,99.742955,99.782995,94.041583,95.143524,498.000746,618.599254
6,Stack+Bag,94.058339,95.146031,99.754802,99.754802,96.823395,97.396589,260.935206,320.264794


In [18]:
%%time
df_authors_values = pd.DataFrame(models, columns=['Model'])
df_authors_values['pr(%)'] = [94.85, 95.46, 92.80, 92.75, 84.46, 84.27, 96.97]
df_authors_values['re(%)'] = [99.73, 99.73, 99.59, 99.57, 99.69, 99.69, 99.43]
df_authors_values['fs(%)'] = [97.23, 97.55, 96.08, 96.04, 91.44, 91.33, 98.18]
df_authors_values['fc'] = [278, 245, 398, 402, 913, 926, 180]

CPU times: user 2.16 ms, sys: 218 µs, total: 2.38 ms
Wall time: 2.21 ms


In [19]:
df_authors_values

Unnamed: 0,Model,pr(%),re(%),fs(%),fc
0,SVM,94.85,99.73,97.23,278
1,SVM-bag,95.46,99.73,97.55,245
2,NN,92.8,99.59,96.08,398
3,NN-bag,92.75,99.57,96.04,402
4,LR,84.46,99.69,91.44,913
5,LR-bag,84.27,99.69,91.33,926
6,Stack+Bag,96.97,99.43,98.18,180


In [20]:
%%time
within_2_std = pd.DataFrame(models, columns=['Model'])
within_2_std['pr'] = (df_authors_values['pr(%)'] >= df_lower_and_upper['pr(%)_lower']) & (df_authors_values['pr(%)'] <= df_lower_and_upper['pr(%)_uppper'])
within_2_std['re'] = (df_authors_values['re(%)'] >= df_lower_and_upper['re(%)_lower']) & (df_authors_values['re(%)'] <= df_lower_and_upper['re(%)_uppper'])
within_2_std['fs'] = (df_authors_values['fs(%)'] >= df_lower_and_upper['fs(%)_lower']) & (df_authors_values['fs(%)'] <= df_lower_and_upper['fs(%)_uppper'])
within_2_std['fc'] = (df_authors_values['fc'] >= df_lower_and_upper['fc_lower']) & (df_authors_values['fc'] <= df_lower_and_upper['fc_uppper'])


CPU times: user 3.59 ms, sys: 344 µs, total: 3.93 ms
Wall time: 3.69 ms


In [21]:
# True = the paper results fall within 2 standard deviations of the mean according to the reproduce results
# False = the paper results don't fall within 2 standard deviations of the mean according to the reproduce results
within_2_std

Unnamed: 0,Model,pr,re,fs,fc
0,SVM,False,False,False,False
1,SVM-bag,False,False,False,False
2,NN,False,False,False,False
3,NN-bag,False,False,False,False
4,LR,False,False,False,False
5,LR-bag,False,False,False,False
6,Stack+Bag,False,False,False,False
