# How do violations of the conditional independence assumption affect parameter estimates?

The `generate_df_gammas_random` function allows the user to specify `cov_m` and `cov_u`, which covariance matrices that dictates the correlation between values in the comparison vector (the 'gamma' columns).

❗️Warning:  The correlations in the resultant dataset are similar to these correlations, but usually lower in strength.

How does this work?

Imagine we are generating values for matches according to the following settings:

```
settings = {
    "comparison_columns": [
        {
            "col_name": "col_1",
            "m_probabilities": [0.3, 0.7],

        },
        {
            "col_name": "col_2",
            "m_probabilities": [0.1, 0.9],
        }
    ],
}
```

For the case of no correlation, we generate two independent random numbers $x1$ and $x2$, which are distributed $U(0,1)$

We then say:
```
if x1 > 0.3 then gamma_col_1 = 1 else gamma_col_1 = 0
if x2 > 0.1 then gamma_col_2 = 1 else gamma_col_2 = 0
```

Where a covariance matrix `cov_m` is provided, we generate a pair of numbers   $n1$ and $n2$  using it from a multivariate random normal distribution 

We then map these two numbers to the domain $[0,1]$ by transforming them through the Normal distribution's cumulative distribution function, yielding  $x1$ and $x2$.

Finally, we apply the same logic as above.

This generalises easily to the case where there are more than two levels (the gamma column can contain more than two values).

## What sort of correlations do we observe in generated data?

In [17]:
from splink_data_generation.generate_data_random import generate_df_gammas_random

from copy import deepcopy
settings = {
    "link_type": "dedupe_only",
    "proportion_of_matches" :0.5,
    "comparison_columns": [
        {
            "col_name": "col_1",
            "m_probabilities": [0.1, 0.9],
            "u_probabilities": [0.8, 0.2],
        },
        {
            "col_name": "col_2",
            "m_probabilities": [0.1, 0.9],
            "u_probabilities": [0.8, 0.2],
        }
    ],

}

In [18]:
from IPython.display import display, Markdown
def print_correlations(df):

    gamma_cols = [c for c in df.columns if 'gamma_' in c]
    
    df_m = df[df["true_match_l"]==1]
    display(Markdown("Correlations amongst true matches:"))
    display(df_m[gamma_cols].corr())
    df_u = df[df["true_match_l"]==0]
    display(Markdown("Correlations amongst true non-matches:"))
    display(df_u[gamma_cols].corr())

In [22]:
df_no_cov = generate_df_gammas_random(50000, deepcopy(settings), cov_m =[[1,0], [0,1]], cov_u =[[1,0], [0,1]])
print_correlations(df_no_cov)

Correlations amongst true matches:

Unnamed: 0,gamma_col_1,gamma_col_2
gamma_col_1,1.0,0.009036
gamma_col_2,0.009036,1.0


Correlations amongst true non-matches:

Unnamed: 0,gamma_col_1,gamma_col_2
gamma_col_1,1.0,0.016403
gamma_col_2,0.016403,1.0


In [23]:
df_no_cov = generate_df_gammas_random(50000, deepcopy(settings), cov_m =[[1,0.5], [0.5,1]], cov_u =[[1,-0.5], [-0.5,1]])
print_correlations(df_no_cov)

Correlations amongst true matches:

Unnamed: 0,gamma_col_1,gamma_col_2
gamma_col_1,1.0,0.264824
gamma_col_2,0.264824,1.0


Correlations amongst true non-matches:

Unnamed: 0,gamma_col_1,gamma_col_2
gamma_col_1,1.0,-0.194135
gamma_col_2,-0.194135,1.0


## What affect do these correlations have on parameter estimates?

Consider the following settings.  We know from the first notebook that Splink is able correctly to estimate these parameters if there's no correlation structure to the data.

What happens if we introduce fairly strong correlations:

In [25]:
settings = {
    "proportion_of_matches": 0.2,
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "col_1",
            "m_probabilities": [0.3, 0.7],  # Probability of typo
            "u_probabilities": [0.9, 0.1],  # Probability of collision
        },
        {
            "col_name": "col_2",
            "m_probabilities": [0.1, 0.9],  # Probability of typo
            "u_probabilities": [0.975, 0.025],  # Probability of collision
        },
        {
            "col_name": "col_3",
            "m_probabilities": [0.05, 0.95],  # Probability of typo
            "u_probabilities": [0.8, 0.2],  # Probability of collision
        },
    ],
     "additional_columns_to_retain": [
        "true_match", "true_match_probability"
    ]
}

In [27]:
cov_m = [
    [1,0.8,0.8],
    [0.8,1,0.8],
    [0.8,0.8,1],
]

cov_u = [
    [1,0.8,0.8],
    [0.8,1,0.8],
    [0.8,0.8,1],
]

df = generate_df_gammas_random(50000, deepcopy(settings), cov_m =cov_m, cov_u =cov_u)

In [35]:
print_correlations(df)

Correlations amongst true matches:

Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3
gamma_col_1,1.0,0.451579,0.319108
gamma_col_2,0.451579,1.0,0.46677
gamma_col_3,0.319108,0.46677,1.0


Correlations amongst true non-matches:

Unnamed: 0,gamma_col_1,gamma_col_2,gamma_col_3
gamma_col_1,1.0,0.376827,0.497911
gamma_col_2,0.376827,1.0,0.292526
gamma_col_3,0.497911,0.292526,1.0


In [29]:
from IPython.display import display, Markdown

import numpy as np

from splink_data_generation.match_prob import add_match_prob
from splink_data_generation.log_likelihood import add_log_likelihood

df = add_match_prob(df, settings)
df = add_log_likelihood(df, settings)

binary_prop = df["true_match_l"].mean()
prob_prop = df["true_match_probability_l"].mean()
log_likelihood = sum(df["true_log_likelihood_l"])

md = f"""
The number of rows in the simulated dataset is {len(df):,.0f}

The proportion of matches according to `true_match` status is {binary_prop:,.4f}

The expected proportion of matches according to `true_match_probability` status is {prob_prop:,.4f}

The log likelihood of the dataset given the true parameters is {log_likelihood:,.2f}
"""

display(Markdown(md))


The number of rows in the simulated dataset is 50,000

The proportion of matches according to `true_match` status is 0.2000

The expected proportion of matches according to `true_match_probability` status is 0.2214

The log likelihood of the dataset given the true parameters is -62,717.50


In [34]:
f_match = df["true_match_l"] == 1
f_non_match = df["true_match_l"] == 0

m_col_1 = df[f_match]["gamma_col_1"].mean()
u_col_1 = df[f_non_match]["gamma_col_1"].mean()

m_col_2 = df[f_match]["gamma_col_2"].mean()
u_col_2 = df[f_non_match]["gamma_col_2"].mean()

m_col_3 = df[f_match]["gamma_col_3"].mean()
u_col_3 = df[f_non_match]["gamma_col_3"].mean()

md = f"""
**For gamma_col_1:**

m probabilities: [{1- m_col_1:,.4f},{m_col_1:,.4f}]

u probabilities: [{1- u_col_1:,.4f},{u_col_1:,.4f}]

**For gamma_col_2:**

m probabilities: [{1- m_col_2:,.4f},{m_col_2:,.4f}]

u probabilities: [{1- u_col_2:,.4f},{u_col_2:,.4f}]


**For gamma_col_3:**

m probabilities: [{1- m_col_3:,.4f},{m_col_3:,.4f}]

u probabilities: [{1- u_col_3:,.4f},{u_col_3:,.4f}]

"""

display(Markdown(md))


**For gamma_col_1:**

m probabilities: [0.3045,0.6955]

u probabilities: [0.8973,0.1027]

**For gamma_col_2:**

m probabilities: [0.1026,0.8974]

u probabilities: [0.9753,0.0247]


**For gamma_col_3:**

m probabilities: [0.0475,0.9525]

u probabilities: [0.7985,0.2015]



In [36]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

In [38]:
settings_2 = {

    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "col_1",
        },
        {
            "col_name": "col_2",
        },
        {
            "col_name": "col_3",
        },
    ],
     "additional_columns_to_retain": [
        "true_match", "true_match_probability"
    ]
}

from splink_data_generation.estimate_splink import estimate
df_e, linker = estimate(df, deepcopy(settings_2) ,spark)
df_e.toPandas().head(5)

INFO:splink.expectation_step:Log likelihood for iteration 0:  -66771.64673389221
INFO:splink.iterate:Iteration 0 complete
INFO:splink.params:The maximum change in parameters was 0.09864112138748171 for key π_gamma_col_2_prob_dist_match_level_1_probability
INFO:splink.expectation_step:Log likelihood for iteration 1:  -60075.97787245271
INFO:splink.iterate:Iteration 1 complete
INFO:splink.params:The maximum change in parameters was 0.02306593954563141 for key π_gamma_col_1_prob_dist_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 2:  -59733.09941977019
INFO:splink.iterate:Iteration 2 complete
INFO:splink.params:The maximum change in parameters was 0.012529104948043823 for key π_gamma_col_1_prob_dist_match_level_0_probability
INFO:splink.expectation_step:Log likelihood for iteration 3:  -59676.989747042055
INFO:splink.iterate:Iteration 3 complete
INFO:splink.params:The maximum change in parameters was 0.006862357258796692 for key π_gamma_col_2_prob_dist

Unnamed: 0,match_probability,unique_id_l,unique_id_r,gamma_col_1,prob_gamma_col_1_non_match,prob_gamma_col_1_match,gamma_col_2,prob_gamma_col_2_non_match,prob_gamma_col_2_match,gamma_col_3,prob_gamma_col_3_non_match,prob_gamma_col_3_match,true_match_l,true_match_r,true_match_probability_l,true_match_probability_r
0,0.999933,dd975959,b1fe638b,1,0.025434,0.783357,1,0.004243,0.759035,1,0.1276,0.995131,1,1,0.996669,0.996669
1,0.999933,47d28ac7,bbc3fd6c,1,0.025434,0.783357,1,0.004243,0.759035,1,0.1276,0.995131,1,1,0.996669,0.996669
2,0.990828,90620010,3fc7a7fd,0,0.974566,0.216643,1,0.004243,0.759035,1,0.1276,0.995131,1,1,0.934426,0.934426
3,0.999933,2b6659f8,866805f0,1,0.025434,0.783357,1,0.004243,0.759035,1,0.1276,0.995131,1,1,0.996669,0.996669
4,0.999933,a1a3ced2,f636f1af,1,0.025434,0.783357,1,0.004243,0.759035,1,0.1276,0.995131,1,1,0.996669,0.996669


In [39]:
linker.model

λ (proportion of matches) = 0.25831910967826843
------------------------------------
gamma_col_1: Comparison of col_1

Probability distribution of gamma values amongst matches:
    value 0: 0.216643 (level represents lowest category of string similarity)
    value 1: 0.783357 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.974566 (level represents lowest category of string similarity)
    value 1: 0.025434 (level represents highest category of string similarity)
------------------------------------
gamma_col_2: Comparison of col_2

Probability distribution of gamma values amongst matches:
    value 0: 0.240965 (level represents lowest category of string similarity)
    value 1: 0.759035 (level represents highest category of string similarity)

Probability distribution of gamma values amongst non-matches:
    value 0: 0.995757 (level represents lowest category of string similarity)
    value 1: 0.0042

This allows us to compare the probabilities estimated by splink to the 'true' probabilities

In [73]:
df_e_pd = df_e.toPandas()
df_e_pd = df_e_pd[["match_probability", "true_match_probability_l", "true_match_l"]]
df_e_pd["ln_est_match_prob"] = np.log(df_e_pd["match_probability"]/(1-df_e_pd["match_probability"]))
df_e_pd["ln_true_match_prob"] =np.log(df_e_pd["true_match_probability_l"]/(1-df_e_pd["true_match_probability_l"]))
df_e_pd

Unnamed: 0,match_probability,true_match_probability_l,true_match_l,ln_est_match_prob,ln_true_match_prob
0,0.999933,0.996669,1,9.613638,5.701279
1,0.999933,0.996669,1,9.613638,5.701279
2,0.990828,0.934426,1,4.682389,2.656757
3,0.999933,0.996669,1,9.613638,5.701279
4,0.999933,0.996669,1,9.613638,5.701279


In [60]:
import altair as alt

In [80]:
len(df_e_pd)

50000

In [85]:
import altair as alt

alt.data_transformers.enable('data_server')

alt.Chart(df_e_pd).mark_circle().encode(
    y='ln_est_match_prob',
    x='ln_true_match_prob',
    tooltip=['ln_est_match_prob', 'ln_true_match_prob', 'true_match_l']
)

We see that, whilst generally a higher true match prob leads to a higher estimated match prob, we do not have monotonicity