# What is the effect of multiple blocking rules on `m` and `u` values?

Splink allows the user to specify multiple blocking rules.  When more than one blocking rule is specified, Splink will generate a deduplicated dataset of all record comparisons that obey at least one of the blocking rules.  The Fellegi-Sunter/Expectation Maximisation approach is then used on this combined dataset.

In this notebook, I demonstrate why this can be problematic.

I generate a synthetic dataset with known `m` and `u` values for four comparison columns, `col_1`, `col_2`, `col_3` `col_4`.  

I then block on `col_1` or `col_2`, and estimate the paramters of the model.

#### Key findings:

- Blocking on `col_1` or `col_2` and then including them as parameter estimates results in incorrect estimates for _all_columns
- Blocking on `col_1` or `col_2` and then estimating parameter  estimates for only `col_3` `col_4` results in correct parameter estimates for `col_3` and `col_4`


In [73]:
from copy import deepcopy
settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "col_1",
            "m_probabilities": [0.1, 0.9],  # Probability of typo
            "u_probabilities": [0.95, 0.05],  # Probability of collision
        },
        {
            "col_name": "col_2",
            "m_probabilities": [0.05, 0.95],  # Probability of typo
            "u_probabilities": [0.975, 0.025],  # Probability of collision
        },
        {
            "col_name": "col_3",
            "m_probabilities": [0.2, 0.8],  # Probability of typo
            "u_probabilities": [0.8, 0.2],  # Probability of collision
        },
          {
            "col_name": "col_4",
            "m_probabilities": [0.05, 0.95],  # Probability of typo
            "u_probabilities": [0.9, 0.1],  # Probability of collision
        },
    ],
     "additional_columns_to_retain": [
        "true_match"
    ]
}

In [74]:
from splink_data_generation.generate_data_exact import generate_df_gammas_exact
df = generate_df_gammas_exact(settings)
cols = [c for c in df.columns if "_r" not in c]
len(df)



60000

In [75]:
df.head()

Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3,gamma_col_4,true_match_l,true_match_r,unique_id_l,unique_id_r
0,0,0,0,0,1,1,17c210b7,23087bd2
1,0,1,0,0,1,1,cc8e1bf9,02a4c2a1
2,0,1,0,0,1,1,9ae59505,06450471
3,0,1,0,0,1,1,0915e334,2f226b72
4,0,1,0,0,1,1,0463f328,4d161f05


We can see that in the whole dataset, the assupmption of conditional independence of comparison vector values given match status is true

In [76]:
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
gamma_cols = [c for c in df.columns if 'gamma_' in c]
f1 = df["true_match_l"] == 1
df.loc[f1, gamma_cols].corr()

Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3,gamma_col_4
gamma_col_1,1.0,0.0,0.0,-0.0
gamma_col_2,0.0,1.0,0.0,-0.0
gamma_col_3,0.0,0.0,1.0,-0.0
gamma_col_4,-0.0,-0.0,-0.0,1.0


In [77]:
f2 = df["true_match_l"] == 0
df.loc[f2, gamma_cols].corr()

Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3,gamma_col_4
gamma_col_1,1.0,-0.0,0.0,-0.0
gamma_col_2,-0.0,1.0,0.0,0.0
gamma_col_3,0.0,0.0,1.0,0.0
gamma_col_4,-0.0,0.0,0.0,1.0


This assumption also holds if we block on a single variable (`col_1`):

In [78]:
block_1 = df["gamma_col_1"] == 1
df.loc[f1&block_1, gamma_cols].corr()

Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3,gamma_col_4
gamma_col_1,,,,
gamma_col_2,,1.0,0.0,-0.0
gamma_col_3,,0.0,1.0,-0.0
gamma_col_4,,-0.0,-0.0,1.0


However, it no longer holds if we block on two variables:

In [79]:
block_2 = df["gamma_col_2"] == 1
df_blocked = df[block_1 | block_2]

In [80]:
f1 = df_blocked["true_match_l"] == 1
df_blocked.loc[f1, gamma_cols].corr()


Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3,gamma_col_4
gamma_col_1,1.0,-0.0707,0.0,-0.0
gamma_col_2,-0.0707,1.0,0.0,0.0
gamma_col_3,0.0,0.0,1.0,-0.0
gamma_col_4,-0.0,0.0,-0.0,1.0


In [81]:
f1 = df_blocked["true_match_l"] == 0
df_blocked.loc[f1, gamma_cols].corr()


Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3,gamma_col_4
gamma_col_1,1.0,-0.9624,0.0,-0.0
gamma_col_2,-0.9624,1.0,0.0,0.0
gamma_col_3,0.0,0.0,1.0,-0.0
gamma_col_4,-0.0,0.0,-0.0,1.0


## What do these correlations (the violation of conditional independence) this mean for the estimation of parameters?

### Example 1:  Esimtate all parameters


In [82]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

In [83]:
from splink_data_generation.estimate_splink import estimate

settings_2 = deepcopy(settings)
settings_2["proportion_of_matches"] = df_blocked["true_match_l"].mean()

df_e, linker = estimate(df_blocked,settings_2,spark)
df_e.toPandas().head(5)

INFO:splink.expectation_step:Log likelihood for iteration 0:  -42423.936352907534
INFO:splink.iterate:Iteration 0 complete
INFO:splink.params:The maximum change in parameters was 0.6787893295288085 for key π_gamma_col_1_prob_dist_non_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 1:  -37339.63852485094
INFO:splink.iterate:Iteration 1 complete
INFO:splink.params:The maximum change in parameters was 0.03456217050552368 for key π_gamma_col_2_prob_dist_non_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 2:  -37214.084937776075
INFO:splink.iterate:Iteration 2 complete
INFO:splink.params:The maximum change in parameters was 0.02778526395559311 for key π_gamma_col_3_prob_dist_non_match_level_1_probability
INFO:splink.expectation_step:Log likelihood for iteration 3:  -37162.240116894514
INFO:splink.iterate:Iteration 3 complete
INFO:splink.params:The maximum change in parameters was 0.022252395749092102 for key π_gamma_col

Unnamed: 0,match_probability,unique_id_l,unique_id_r,gamma_col_1,prob_gamma_col_1_non_match,prob_gamma_col_1_match,gamma_col_2,prob_gamma_col_2_non_match,prob_gamma_col_2_match,gamma_col_3,prob_gamma_col_3_non_match,prob_gamma_col_3_match,gamma_col_4,prob_gamma_col_4_non_match,prob_gamma_col_4_match,true_match_l,true_match_r
0,0.0506,cc8e1bf9,02a4c2a1,0,0.2705,0.1022,1,0.4083,0.9473,0,0.7558,0.2036,0,0.9748,0.0339,1,1
1,0.0506,9ae59505,06450471,0,0.2705,0.1022,1,0.4083,0.9473,0,0.7558,0.2036,0,0.9748,0.0339,1,1
2,0.0506,0915e334,2f226b72,0,0.2705,0.1022,1,0.4083,0.9473,0,0.7558,0.2036,0,0.9748,0.0339,1,1
3,0.0506,0463f328,4d161f05,0,0.2705,0.1022,1,0.4083,0.9473,0,0.7558,0.2036,0,0.9748,0.0339,1,1
4,0.0506,30009c84,4952422d,0,0.2705,0.1022,1,0.4083,0.9473,0,0.7558,0.2036,0,0.9748,0.0339,1,1


In [84]:
linker.params

λ (proportion of matches) = 0.8663036227226257
------------------------------------
gamma_col_1: Comparison of col_1

Probability distribution of gamma values amongst matches:
    value 0: 0.102230 (level represents lowest category of string similarity)
    value 1: 0.897770 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.270495 (level represents lowest category of string similarity)
    value 1: 0.729505 (level represents highest category of string similarity)
------------------------------------
gamma_col_2: Comparison of col_2

Probability distribution of gamma values amongst matches:
    value 0: 0.052661 (level represents lowest category of string similarity)
    value 1: 0.947339 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.591683 (level represents lowest category of string similarity)
    value 1: 0.40831

### Example 2:  Esimtate only col_2, col_3 params

In [85]:
from copy import deepcopy
settings_3 = {
    "link_type": "dedupe_only",
    "proportion_of_matches":df_blocked["true_match_l"].mean(), 
    "comparison_columns": [
        {
            "col_name": "col_3",
            "m_probabilities": [0.2, 0.8],  # Probability of typo
            "u_probabilities": [0.8, 0.2],  # Probability of collision
        },
          {
            "col_name": "col_4",
            "m_probabilities": [0.05, 0.95],  # Probability of typo
            "u_probabilities": [0.9, 0.1],  # Probability of collision
        },
    ],
     "additional_columns_to_retain": [
        "true_match"
    ]
}

In [86]:
df_e, linker = estimate(df_blocked,deepcopy(settings_3),spark)
df_e.toPandas().head(5)

INFO:splink.expectation_step:Log likelihood for iteration 0:  -22276.91616396846
INFO:splink.iterate:Iteration 0 complete
INFO:splink.params:The maximum change in parameters was 2.384185793236071e-08 for key π_gamma_col_4_prob_dist_non_match_level_0_probability
INFO:splink.iterate:EM algorithm has converged
INFO:splink.expectation_step:Log likelihood for iteration 1:  -22276.916111814535


Unnamed: 0,match_probability,unique_id_l,unique_id_r,gamma_col_3,prob_gamma_col_3_non_match,prob_gamma_col_3_match,gamma_col_4,prob_gamma_col_4_non_match,prob_gamma_col_4_match,true_match_l,true_match_r
0,0.0857,cc8e1bf9,02a4c2a1,0,0.8,0.2,0,0.9,0.05,1,1
1,0.0857,9ae59505,06450471,0,0.8,0.2,0,0.9,0.05,1,1
2,0.0857,0915e334,2f226b72,0,0.8,0.2,0,0.9,0.05,1,1
3,0.0857,0463f328,4d161f05,0,0.8,0.2,0,0.9,0.05,1,1
4,0.0857,30009c84,4952422d,0,0.8,0.2,0,0.9,0.05,1,1


In [87]:
linker.params

λ (proportion of matches) = 0.8708971738815308
------------------------------------
gamma_col_3: Comparison of col_3

Probability distribution of gamma values amongst matches:
    value 0: 0.200000 (level represents lowest category of string similarity)
    value 1: 0.800000 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.800000 (level represents lowest category of string similarity)
    value 1: 0.200000 (level represents highest category of string similarity)
------------------------------------
gamma_col_4: Comparison of col_4

Probability distribution of gamma values amongst matches:
    value 0: 0.050000 (level represents lowest category of string similarity)
    value 1: 0.950000 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.900000 (level represents lowest category of string similarity)
    value 1: 0.10000