# Does blocking affect the m and u estimates of other columns?

In this notebook I consider whether blocking on a column affects the estimates of `m` and `u` values of other columns.

It shows that blocking _does not_ affect the `m` and `u` estimates of other columns

In [1]:
from copy import deepcopy
settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "col_1",
            "m_probabilities": [0.1, 0.9],  # Probability of typo
            "u_probabilities": [0.3, 0.7],  # Probability of collision
        },
        {
            "col_name": "col_2",
            "m_probabilities": [0.05, 0.95],  # Probability of typo
            "u_probabilities": [0.975, 0.025],  # Probability of collision
        },
        {
            "col_name": "col_3",
            "m_probabilities": [0.2, 0.8],  # Probability of typo
            "u_probabilities": [0.8, 0.2],  # Probability of collision
        },
          {
            "col_name": "col_4",
            "m_probabilities": [0.05, 0.95],  # Probability of typo
            "u_probabilities": [0.9, 0.1],  # Probability of collision
        },
    ],
     "additional_columns_to_retain": [
        "true_match", "true_match_probability"
    ]
}

The following cell may take a minute or two to compute

In [2]:
from splink_data_generation.generate_data_exact import generate_df_gammas_exact
from splink_data_generation.match_prob import add_match_prob
from splink_data_generation.log_likelihood import add_log_likelihood
df = generate_df_gammas_exact(settings)
df = add_match_prob(df, settings)
df = add_log_likelihood(df, settings)
cols = [c for c in df.columns if "_r" not in c]
len(df)



40000

In [3]:
df[cols].head()

Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3,gamma_col_4,true_match_l,unique_id_l,true_match_probability_l,true_log_likelihood_l
0,0,0,0,0,1,802e7a59,0.000237,-2.250704
1,0,1,0,0,1,8c5454e0,0.149606,-5.752448
2,0,1,0,0,1,c958e689,0.149606,-5.752448
3,0,1,0,0,1,d2f9233c,0.149606,-5.752448
4,0,1,0,0,1,5c03633b,0.149606,-5.752448


## Does Splink correctly estimate `m` and `u` values for the whole dataset?


In [4]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

In [5]:
# Now use Splink to estimate the params from the data
from splink_data_generation.estimate_splink import estimate

settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "col_1",

        },
        {
            "col_name": "col_2",
       
        },
        {
            "col_name": "col_3",

        },
          {
            "col_name": "col_4",

        },
    ],
     "additional_columns_to_retain": [
        "true_match", "true_match_probability"
    ]
}

df_e, linker = estimate(df, deepcopy(settings) ,spark)
df_e.toPandas().head(5)

INFO:splink.expectation_step:Log likelihood for iteration 0:  -106324.89547299672
INFO:splink.iterate:Iteration 0 complete
INFO:splink.params:The maximum change in parameters was 0.5681762337684632 for key π_gamma_col_1_prob_dist_non_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 1:  -81393.5722664337
INFO:splink.iterate:Iteration 1 complete
INFO:splink.params:The maximum change in parameters was 0.030521035194396973 for key π_gamma_col_2_prob_dist_match_level_1_probability
INFO:splink.expectation_step:Log likelihood for iteration 2:  -80818.52456855273
INFO:splink.iterate:Iteration 2 complete
INFO:splink.params:The maximum change in parameters was 0.010583415627479553 for key π_gamma_col_2_prob_dist_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 3:  -80771.77780874836
INFO:splink.iterate:Iteration 3 complete
INFO:splink.params:The maximum change in parameters was 0.004878949373960495 for key π_gamma_col_2_prob_d

Unnamed: 0,match_probability,unique_id_l,unique_id_r,gamma_col_1,prob_gamma_col_1_non_match,prob_gamma_col_1_match,gamma_col_2,prob_gamma_col_2_non_match,prob_gamma_col_2_match,gamma_col_3,prob_gamma_col_3_non_match,prob_gamma_col_3_match,gamma_col_4,prob_gamma_col_4_non_match,prob_gamma_col_4_match,true_match_l,true_match_r,true_match_probability_l,true_match_probability_r
0,0.000239,802e7a59,277be987,0,0.30004,0.100021,0,0.975078,0.050206,0,0.800106,0.200078,0,0.900187,0.050073,1,1,0.000237,0.000237
1,0.150257,8c5454e0,84743376,0,0.30004,0.100021,1,0.024922,0.949794,0,0.800106,0.200078,0,0.900187,0.050073,1,1,0.149606,0.149606
2,0.150257,c958e689,4de84e42,0,0.30004,0.100021,1,0.024922,0.949794,0,0.800106,0.200078,0,0.900187,0.050073,1,1,0.149606,0.149606
3,0.150257,d2f9233c,154cb47d,0,0.30004,0.100021,1,0.024922,0.949794,0,0.800106,0.200078,0,0.900187,0.050073,1,1,0.149606,0.149606
4,0.150257,5c03633b,2272039d,0,0.30004,0.100021,1,0.024922,0.949794,0,0.800106,0.200078,0,0.900187,0.050073,1,1,0.149606,0.149606


In [6]:
linker.params

λ (proportion of matches) = 0.5001532435417175
------------------------------------
gamma_col_1: Comparison of col_1

Probability distribution of gamma values amongst matches:
    value 0: 0.100021 (level represents lowest category of string similarity)
    value 1: 0.899979 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.300040 (level represents lowest category of string similarity)
    value 1: 0.699960 (level represents highest category of string similarity)
------------------------------------
gamma_col_2: Comparison of col_2

Probability distribution of gamma values amongst matches:
    value 0: 0.050206 (level represents lowest category of string similarity)
    value 1: 0.949794 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.975078 (level represents lowest category of string similarity)
    value 1: 0.02492

## Estimating parameters on blocked data

Splink has correctly estiamted parameters for the full dataset.

How about if we blocked on column 1.

Note this is equivalent to filtering the dataset down to only rows where `gamma_col_1 = 1`

In [7]:
df_col_1 = df[df["gamma_col_1"] == 1]

In [8]:
settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "col_2",
       
        },
        {
            "col_name": "col_3",

        },
          {
            "col_name": "col_4",

        },
    ],
    "max_iterations": 200,
    "em_convergence": 0.0001,
     "additional_columns_to_retain": [
        "true_match", "true_match_probability"
    ]
}

df_e, linker = estimate(df_col_1, deepcopy(settings) ,spark)
df_e.toPandas().head(5)


INFO:splink.expectation_step:Log likelihood for iteration 0:  -55502.610051441974
INFO:splink.iterate:Iteration 0 complete
INFO:splink.params:The maximum change in parameters was 0.08619900941848757 for key π_gamma_col_3_prob_dist_non_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 1:  -49623.716720507335
INFO:splink.iterate:Iteration 1 complete
INFO:splink.params:The maximum change in parameters was 0.031153202056884766 for key π_gamma_col_2_prob_dist_non_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 2:  -49299.89222942199
INFO:splink.iterate:Iteration 2 complete
INFO:splink.params:The maximum change in parameters was 0.011668406426906586 for key π_gamma_col_2_prob_dist_non_match_level_1_probability
INFO:splink.expectation_step:Log likelihood for iteration 3:  -49254.70688829544
INFO:splink.iterate:Iteration 3 complete
INFO:splink.params:The maximum change in parameters was 0.005676984786987305 for key π_gamma_c

Unnamed: 0,match_probability,unique_id_l,unique_id_r,gamma_col_2,prob_gamma_col_2_non_match,prob_gamma_col_2_match,gamma_col_3,prob_gamma_col_3_non_match,prob_gamma_col_3_match,gamma_col_4,prob_gamma_col_4_non_match,prob_gamma_col_4_match,true_match_l,true_match_r,true_match_probability_l,true_match_probability_r
0,0.000909,eee72f8f,5d5d784e,0,0.974769,0.049861,0,0.799886,0.199882,0,0.899833,0.049837,1,1,0.000915,0.000915
1,0.401034,ef1c82f5,dcc0cfb1,1,0.025231,0.950139,0,0.799886,0.199882,0,0.899833,0.049837,1,1,0.404255,0.404255
2,0.401034,681a2b34,266bc67e,1,0.025231,0.950139,0,0.799886,0.199882,0,0.899833,0.049837,1,1,0.404255,0.404255
3,0.401034,d44c77fc,0bb1223f,1,0.025231,0.950139,0,0.799886,0.199882,0,0.899833,0.049837,1,1,0.404255,0.404255
4,0.401034,1546b53f,65d69f7a,1,0.025231,0.950139,0,0.799886,0.199882,0,0.899833,0.049837,1,1,0.404255,0.404255


In [9]:
linker.params

λ (proportion of matches) = 0.5623059868812561
------------------------------------
gamma_col_2: Comparison of col_2

Probability distribution of gamma values amongst matches:
    value 0: 0.049861 (level represents lowest category of string similarity)
    value 1: 0.950139 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.974769 (level represents lowest category of string similarity)
    value 1: 0.025231 (level represents highest category of string similarity)
------------------------------------
gamma_col_3: Comparison of col_3

Probability distribution of gamma values amongst matches:
    value 0: 0.199882 (level represents lowest category of string similarity)
    value 1: 0.800118 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.799886 (level represents lowest category of string similarity)
    value 1: 0.20011

We see that both the `m` and `u` parameter estimates are correct for the remaining columns on the blocked data