## Estimating m from a sample of pairwise labels

In this example, we estimate the m probabilities of the model from a table containing pairwise record comparisons which we know are 'true' matches. For example, these may be the result of work by a clerical team who have manually labelled a sample of matches.

The table must be in the following format:

| source_dataset_l | unique_id_l | source_dataset_r | unique_id_r |
| ---------------- | ----------- | ---------------- | ----------- |
| df_1             | 1           | df_2             | 2           |
| df_1             | 1           | df_2             | 3           |

It is assumed that every record in the table represents a certain match.

Note that the column names above are the defaults. They should correspond to the values you've set for [`unique_id_column_name`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#unique_id_column_name) and [`source_dataset_column_name`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#source_dataset_column_name), if you've chosen custom values.


<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/splink4_dev/docs/demos/examples/duckdb/pairwise_labels.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [1]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install git+https://github.com/moj-analytical-services/splink.git@splink4_dev

In [2]:
from splink.datasets import splink_dataset_labels

pairwise_labels = splink_dataset_labels.fake_1000_labels

# Choose labels indicating a match
pairwise_labels = pairwise_labels[pairwise_labels["clerical_match_score"] == 1]
pairwise_labels

Unnamed: 0,unique_id_l,source_dataset_l,unique_id_r,source_dataset_r,clerical_match_score
0,0,fake_1000,1,fake_1000,1.0
1,0,fake_1000,2,fake_1000,1.0
2,0,fake_1000,3,fake_1000,1.0
49,1,fake_1000,2,fake_1000,1.0
50,1,fake_1000,3,fake_1000,1.0
...,...,...,...,...,...
3171,994,fake_1000,996,fake_1000,1.0
3172,995,fake_1000,996,fake_1000,1.0
3173,997,fake_1000,998,fake_1000,1.0
3174,997,fake_1000,999,fake_1000,1.0


We now proceed to estimate the Fellegi Sunter model:


In [3]:
from splink import splink_datasets

df = splink_datasets.fake_1000
df.head(2)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,cluster
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0


In [4]:
from splink import Linker, DuckDBAPI, block_on, SettingsCreator
import splink.comparison_library as cl
import splink.comparison_template_library as ctl


settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        ctl.NameComparison("first_name"),
        ctl.NameComparison("surname"),
        ctl.DateComparison(
            "dob",
            input_is_string=True,
            invalid_dates_as_null=True,
            datetime_metrics=["month", "year", "year"],
            datetime_thresholds=[1, 1, 10],
        ),
        cl.LevenshteinAtThresholds("dob", [2]),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        ctl.EmailComparison("email", include_username_fuzzy_level=False),
    ],
    retain_intermediate_calculation_columns=True
)



In [5]:
from splink.duckdb.database_api import DuckDBAPI

linker = Linker(df, settings, database_api=DuckDBAPI(), set_up_basic_logging=False)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email",
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

In [6]:
linker.estimate_u_using_random_sampling(max_pairs=1e6)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [7]:
# Register the pairwise labels table with the database, and then use it to estimate the m values
labels_df = linker.register_labels_table(pairwise_labels, overwrite=True)
linker.estimate_m_from_pairwise_labels(labels_df)


# If the labels table already existing in the dataset you could run
# linker.estimate_m_from_pairwise_labels("labels_tablename_here")

In [8]:
training_blocking_rule = block_on("first_name")
linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

<EMTrainingSession, blocking on l."first_name" = r."first_name", deactivating comparisons first_name>

In [9]:
linker.parameter_estimate_comparisons_chart()

In [10]:
linker.match_weights_chart()