## Exploring linkage models using real time linkage

In this notebook, I demonstrate the use of the `linker.compare_two_records` function to interactively explore the results of a linkage model.

### Step 1: Load a pre-trained linkage model

In [1]:
import pandas as pd
import json
from splink.duckdb.duckdb_linker import DuckDBLinker

with open("demo_settings/real_time_settings.json") as f:
    trained_settings = json.load(f)

df = pd.read_csv("./data/fake_1000.csv")

linker = DuckDBLinker(trained_settings, input_tables={"input_df": df})
linker._initialise_df_concat_with_tf()
linker.compute_tf_table("first_name")
linker.compute_tf_table("surname")
linker.compute_tf_table("dob")
linker.compute_tf_table("city")
linker.compute_tf_table("email")

<splink.duckdb.duckdb_linker.DuckDBLinkerDataFrame at 0x7ff08da45610>

### Step  Comparing two records

It's now possible to compute a match weight for any two records using `linker.compare_two_records()`

In [2]:
record_1  = {
     'unique_id':1,
     'first_name': "George",
     'surname': "Smith",
     'dob': "1957-02-17",
     'city': "London",
     'email': "john.smith@hotmail.com"
}

record_2  = {
     'unique_id':2,
     'first_name': "George",
     'surname': "Smith",
     'dob': "1957-02-17",
     'city': "London",
     'email': "john.smith@hotmail.com"
}

df_two = linker.compare_two_records(record_1, record_2)
df_two.as_pandas_dataframe()

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,tf_first_name_r,bf_first_name,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,tf_email_l,tf_email_r,bf_email,bf_tf_adj_email
0,22.155343,1.0,1,2,George,George,2,0.01444,0.01444,87.571229,...,0.212792,10.484859,0.259162,john.smith@hotmail.com,john.smith@hotmail.com,1,,,263.229168,1.0


### Step 3: Interactive comparisons

One interesting applicatin of `compare_two_records` is to create a simple interface that allows the user to input two records interactively, and get real time feedback.

In the following cell we use `ipywidets` for this purpose.  ✨✨ Change the values in the text boxes to see the waterfall chart update in real time. ✨✨

In [3]:
import ipywidgets as widgets
fields = ["unique_id", "first_name","surname","dob","email","city"]

left_text_boxes = []
right_text_boxes = []

inputs_to_interactive_output = {}

for f in fields:
    wl = widgets.Text(description=f, value =str(record_1[f]))
    left_text_boxes.append(wl)
    inputs_to_interactive_output[f"{f}_l"] = wl
    wr = widgets.Text( description=f, value =str(record_2[f]))
    right_text_boxes.append(wr)
    inputs_to_interactive_output[f"{f}_r"] = wr


b1 = widgets.VBox(left_text_boxes)
b2 = widgets.VBox(right_text_boxes)
ui = widgets.HBox([b1,b2])

def myfn(**kwargs):
    my_args = dict(kwargs)
    
    record_left = {}
    record_right = {}
    
    for key, value in my_args.items():
        if value == '':
            value = None
        if key.endswith("_l"):
            record_left[key[:-2]] = value
        if key.endswith("_r"):
            record_right[key[:-2]] = value
            

    linker.settings_obj._retain_intermediate_calculation_columns = True
    linker.settings_obj._retain_matching_columns = True

    df_two = linker.compare_two_records(record_left, record_right)
    df_two.as_pandas_dataframe()
    recs = df_two.as_pandas_dataframe().to_dict(orient="records")
    from splink.charts import waterfall_chart
    waterfall_chart(recs, linker.settings_obj, filter_nulls=False)


out = widgets.interactive_output(myfn, inputs_to_interactive_output)

display(ui,out)


HBox(children=(VBox(children=(Text(value='1', description='unique_id'), Text(value='George', description='firs…

Output()

## Finding matching records interactively

It is also possible to search the records in the input dataset rapidly using the `linker.find_matches_to_new_records()` function

In [4]:
record = {'unique_id': 123987,
 'first_name': "Robert",
 'surname': "Alan",
 'dob': "1971-05-24",
 'city': "London",
 'email': "robert255@smith.net"
}



df_inc = linker.find_matches_to_new_records([record], blocking_rules=[]).as_pandas_dataframe()
df_inc.sort_values("match_weight", ascending=False)

Unnamed: 0,match_weight,match_probability,source_dataset_l,unique_id_l,source_dataset_r,unique_id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,tf_email_l,tf_email_r,bf_email,bf_tf_adj_email
1,23.531793,1.0,input_df,0,__splink__df_new_records,123987,Robert,Robert,2,0.00361,...,0.212792,1.0,1.0,robert255@smith.net,robert255@smith.net,1,0.001267,0.001267,263.229168,1.730964
2,14.55032,0.999958,input_df,1,__splink__df_new_records,123987,Robert,Robert,2,0.00361,...,0.212792,1.0,1.0,roberta25@smith.net,robert255@smith.net,0,0.002535,0.001267,0.423438,1.0
4,10.388623,0.999255,input_df,3,__splink__df_new_records,123987,Robert,Robert,2,0.00361,...,0.212792,0.446404,1.0,,robert255@smith.net,-1,,0.001267,1.0,1.0
0,2.427256,0.843228,input_df,2,__splink__df_new_records,123987,Rob,Robert,0,0.001203,...,0.212792,10.484859,0.259162,roberta25@smith.net,robert255@smith.net,0,0.002535,0.001267,0.423438,1.0
5,-2.12309,0.186697,input_df,8,__splink__df_new_records,123987,,Robert,-1,,...,0.212792,1.0,1.0,,robert255@smith.net,-1,,0.001267,1.0,1.0
6,-2.205894,0.178139,input_df,754,__splink__df_new_records,123987,,Robert,-1,,...,0.212792,1.0,1.0,j.c@whige.wort,robert255@smith.net,0,0.001267,0.001267,0.423438,1.0
3,-2.802309,0.125383,input_df,750,__splink__df_new_records,123987,,Robert,-1,,...,0.212792,10.484859,0.259162,j.c@white.org,robert255@smith.net,0,0.002535,0.001267,0.423438,1.0


## Interactive interface for finding records

Again, we can use `ipywidgets` to build an interactive interface for the `linker.find_matches_to_new_records` function

In [5]:
from splink.charts import waterfall_chart

@widgets.interact(first_name='Robert', surname="Alan", dob="1971-05-24", city="London", email="robert255@smith.net")
def interactive_link(first_name, surname, dob, city, email):    

    record = {'unique_id': 123987,
     'first_name': first_name,
     'surname': surname,
     'dob': dob,
     'city': city,
     'email': email,
     'group': 0}

    for key in record.keys():
        if type(record[key]) == str:
            if record[key].strip() == "":
                record[key] = None

    
    df_inc = linker.find_matches_to_new_records([record], blocking_rules=[f"(true)"]).as_pandas_dataframe()
    df_inc = df_inc.sort_values("match_weight", ascending=False)
    recs = df_inc.to_dict(orient="records")
    


    waterfall_chart(recs, linker.settings_obj, filter_nulls=False)


interactive(children=(Text(value='Robert', description='first_name'), Text(value='Alan', description='surname'…