# Exploring convergence problems

In this notebook I 

There are a few conclusions - some of these may be specific to this dataset:
- The EM algorithms does not always converge to the 'correct answer'.
- Where the EM algorithm converges to different parameters than the data generating mechanism, the log likelihood of the input data given these 'incorrect' parameters is _the same_ as the log likelihood of the data given the correct parameters
- If you generate data according to the 'incorrect' parameters it is statistically indistinguishable from data generated according to the correct answer


Consider the following data generating process:

In [10]:
settings = {
    "link_type": "dedupe_only",
    "proportion_of_matches" :0.2,
    "comparison_columns": [
        {
            "col_name": "col_1",
            "m_probabilities": [0.3, 0.7],  # Probability of typo
            "u_probabilities": [0.9, 0.1],  # Probability of collision
        },
        {
            "col_name": "col_2",
            "m_probabilities": [0.1, 0.9],  # Probability of typo
            "u_probabilities": [0.975, 0.025],  # Probability of collision
        }
    ],
    "max_iterations": 200,
    "em_convergence": 0.0001,
    "additional_columns_to_retain": [
        "true_match", "true_match_probability", "true_log_likelihood"
    ]
}

Let's generate some data according to these parameters

In [11]:
from splink_data_generation.generate_data_exact import generate_df_gammas_exact
from splink_data_generation.match_prob import add_match_prob
from splink_data_generation.log_likelihood import add_log_likelihood
df = generate_df_gammas_exact(settings)
df = add_match_prob(df, settings)
df = add_log_likelihood(df, settings)
cols = [c for c in df.columns if "_r" not in c]
df[cols].head()



Unnamed: 0,gamma_col_1,gamma_col_2,true_match_l,unique_id_l,true_match_probability_l,true_log_likelihood_l
0,0,0,1,b70605f4,0.008475,-0.345311
1,0,1,1,8bb19921,0.75,-2.631089
2,0,1,1,17dde00f,0.75,-2.631089
3,0,1,1,48ae98dc,0.75,-2.631089
4,0,1,1,bd84112c,0.75,-2.631089


First we will use Splink to estimate the parameters, setting the true parameters as the starting values.

In [12]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)


In [14]:
# Now use Splink to estimate the params from the data

from splink_data_generation.estimate_splink import estimate

df_e, linker = estimate(df, settings ,spark)
df_e.toPandas().head(5)

INFO:splink.expectation_step:Log likelihood for iteration 0:  -458.2802386038024
INFO:splink.iterate:Iteration 0 complete
INFO:splink.params:The maximum change in parameters was 2.384185793236071e-08 for key π_gamma_col_1_prob_dist_non_match_level_0_probability
INFO:splink.iterate:EM algorithm has converged
INFO:splink.expectation_step:Log likelihood for iteration 1:  -458.2802400939189


Unnamed: 0,match_probability,unique_id_l,unique_id_r,gamma_col_1,prob_gamma_col_1_non_match,prob_gamma_col_1_match,gamma_col_2,prob_gamma_col_2_non_match,prob_gamma_col_2_match,true_match_l,true_match_r,true_match_probability_l,true_match_probability_r,true_log_likelihood_l,true_log_likelihood_r
0,0.008475,b70605f4,bcbc0e28,0,0.9,0.3,0,0.975,0.1,1,1,0.008475,0.008475,-0.345311,-0.345311
1,0.75,8bb19921,33d3d035,0,0.9,0.3,1,0.025,0.9,1,1,0.75,0.75,-2.631089,-2.631089
2,0.75,17dde00f,8945a9ce,0,0.9,0.3,1,0.025,0.9,1,1,0.75,0.75,-2.631089,-2.631089
3,0.75,48ae98dc,126bada9,0,0.9,0.3,1,0.025,0.9,1,1,0.75,0.75,-2.631089,-2.631089
4,0.75,bd84112c,f7dd4642,0,0.9,0.3,1,0.025,0.9,1,1,0.75,0.75,-2.631089,-2.631089


As expected, Splink converges immediately because it realises that it's already at an optima

## What happens if we use Splink default starting values?

In [29]:
settings_2 = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "col_1",
        },
        {
            "col_name": "col_2",
        }
    ],
    "max_iterations": 200,
    "em_convergence": 0.0001,
    "additional_columns_to_retain": [
        "true_match", "true_match_probability", "true_log_likelihood"
    ]
}

In [30]:
df_e, linker = estimate(df, deepcopy(settings_2) ,spark)

INFO:splink.expectation_step:Log likelihood for iteration 0:  -485.16447604748
INFO:splink.iterate:Iteration 0 complete
INFO:splink.params:The maximum change in parameters was 0.07664320766925811 for key π_gamma_col_2_prob_dist_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 1:  -458.302822026541
INFO:splink.iterate:Iteration 1 complete
INFO:splink.params:The maximum change in parameters was 0.003287985920906067 for key π_gamma_col_2_prob_dist_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 2:  -458.2815435045475
INFO:splink.iterate:Iteration 2 complete
INFO:splink.params:The maximum change in parameters was 0.0007975101470947266 for key π_gamma_col_2_prob_dist_match_level_1_probability
INFO:splink.expectation_step:Log likelihood for iteration 3:  -458.28031808776996
INFO:splink.iterate:Iteration 3 complete
INFO:splink.params:The maximum change in parameters was 0.00019615888595581055 for key π_gamma_col_2_prob_dis

In [31]:
linker.model

λ (proportion of matches) = 0.17653286457061768
------------------------------------
gamma_col_1: Comparison of col_1

Probability distribution of gamma values amongst matches:
    value 0: 0.146975 (level represents lowest category of string similarity)
    value 1: 0.853025 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.915706 (level represents lowest category of string similarity)
    value 1: 0.084294 (level represents highest category of string similarity)
------------------------------------
gamma_col_2: Comparison of col_2

Probability distribution of gamma values amongst matches:
    value 0: 0.180973 (level represents lowest category of string similarity)
    value 1: 0.819027 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.932705 (level represents lowest category of string similarity)
    value 1: 0.0672

## Generating data according to the 'alternative' data generating mechanism

We see we get the _same_ log likelihood with _different_ parameters.

What happens if we generate a dataset using the alternative parameters?  

In [35]:
from splink.params import get_or_update_settings
settings_alternative = get_or_update_settings(linker.model, settings_2)
settings_alternative

{'link_type': 'dedupe_only',
 'comparison_columns': [{'col_name': 'col_1',
   'm_probabilities': [0.14697474241256714, 0.8530252575874329],
   'u_probabilities': [0.9157063961029053, 0.08429359644651413]},
  {'col_name': 'col_2',
   'm_probabilities': [0.18097323179244995, 0.81902676820755],
   'u_probabilities': [0.9327054619789124, 0.06729456037282944]}],
 'max_iterations': 200,
 'em_convergence': 0.0001,
 'additional_columns_to_retain': ['true_match',
  'true_match_probability',
  'true_log_likelihood'],
 'proportion_of_matches': 0.17653286457061768}

Note that we have to use `generate_df_gammas_random` becaues to use `generate_df_gammas_exact` requires the `m` and `u` probabilities to have a 'reasonably small' lowest common multiplier (see [here](https://github.com/moj-analytical-services/splink_data_generation/blob/a2ab256f6cd25899c4c84cfa1b58bca615249a15/splink_data_generation/generate_data_exact.py#L56))

In [49]:
from splink_data_generation.generate_data_random import generate_df_gammas_random
# Increase number of rows for higher accuracy!
df_alt = generate_df_gammas_random(1000000, settings_alternative)  
df_alt.head()

Unnamed: 0,gamma_col_1,gamma_col_2,true_match_l,true_match_r,unique_id_l,unique_id_r
0,0,0,1,1,72f6b960,fc09318a
1,1,1,1,1,b82f182e,67193a46
2,1,1,1,1,8d3abbc4,a72471df
3,1,1,1,1,2ab13e2f,7acae1a7
4,1,1,1,1,b085f5e1,f91f18d2


In a real-world linking situation, the only information we have is the comparison vectors (the gamma columns).

Is there any difference between the gammas columns in this alternative dataset `df_alt`, and the original dataset `df`?

In [50]:
display(df[["gamma_col_1", "gamma_col_2"]].corr())
display(df_alt[["gamma_col_1", "gamma_col_2"]].corr())

Unnamed: 0,gamma_col_1,gamma_col_2
gamma_col_1,1.0,0.506945
gamma_col_2,0.506945,1.0


Unnamed: 0,gamma_col_1,gamma_col_2
gamma_col_1,1.0,0.507404
gamma_col_2,0.507404,1.0


In [51]:
display(df[["gamma_col_1", "gamma_col_2"]].mean())
display(df_alt[["gamma_col_1", "gamma_col_2"]].mean())

gamma_col_1    0.22
gamma_col_2    0.20
dtype: float64

gamma_col_1    0.220664
gamma_col_2    0.200167
dtype: float64

This demonstrates these two dataframes are identical (the small differences are due to statistical variation because we drew rows at random to create `df_alt`)

One reason I find this result surprising is that the `gamma` columns are the same, but the overall proportion of matches differs between the two sets of parameters