# Specifying and estimating a linkage model


In the last tutorial we looked at how we can use blocking rules to generate pairwise record comparisons.

Now it's time to build a probabilistic linkage model to score each of these comparisons.

The resultant match score is a prediction of whether the two records represent the same entity (e.g. are the same person).  You can read more about the theory behind probabilistic linkage models [here](https://www.robinlinacre.com/intro_to_probabilistic_linkage/).




In [None]:
# Begin by reading in the tutorial data again
from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd 
import altair as alt
alt.renderers.enable("mimetype")
df = pd.read_csv("./data/fake_1000.csv")

##  Specifying a linkage model

To produce a match score, `splink` needs to know how to compare the information in our pairwise record comparions.

To be concrete, here is an example comparison:


first_name_l|first_name_r|surname_l|surname_r|dob_l     |dob_r     |city_l|city_r|email_l            |email_r            |
------------|------------|---------|---------|----------|----------|------|------|-------------------|-------------------|
Robert      |Rob         |Allen    |Allen    |1971-05-24|1971-06-24|nan   |London|roberta25@smith.net|roberta25@smith.net|

What functions should we use to assess the similarity of `Rob` vs. `Robert` in the the `first_name` field?  

Should similarity in the `dob` field be computed in the same way, or a different way?

Your job as the developer of a linkage model is to decide what comparisons are most appropriate for the types of data you have.  

### Comparisons

The concept of a `Comparison` has a specific definition within Splink: it defines how data from one or more input columns is compared, using SQL expressions assess similarity.

For example, one `Comparison` may represent how similarity is assessed for a person's date of birth.  



Another `Comparison` may represent the comparison of a person's name or location.



A model will thereby be composed of many `Comparison`s, which between them assess the similarity of all of the columns being used for data linking.  

Each `Comparison` contains two or more `ComparisonLevels` which define _n_ discrete gradations of similarity between the input columns within the Comparison.

As such `ComparisonLevels`are nested within `Comparisons` as follows:

```
Data Linking Model
├─-- Comparison: Date of birth
│    ├─-- ComparisonLevel: Exact match
│    ├─-- ComparisonLevel: One character difference
│    ├─-- ComparisonLevel: All other
├─-- Comparison: Surname
│    ├─-- ComparisonLevel: Exact match on surname
│    ├─-- ComparisonLevel: All other
│    etc.
```

Our example data would therefore result in the following comparisons, for `dob` and `surname`:

|dob_l     |dob_r     |comparison_level|
|----------|----------|---------|
|1971-05-24|1971-06-24|One character difference|



surname_l|surname_r|comparison_level|
---------|---------|---------|
Allen    |Allen    |Exact match|

More information about comparisons can be found [here](https://moj-analytical-services.github.io/splink/comparison.html).


We will now use these concepts to build a data linking model

### Specifying the model using comparisons

Splink includes libraries of comparison functions to make it simple to get started:

Let's start by looking at a `Comparison` for `first_name`:

In [None]:
import splink.duckdb.duckdb_comparison_library as cl

first_name_comparison = cl.levenshtein_at_thresholds("first_name", 2)
print(first_name_comparison.human_readable_description)


## Specifying the full settings dictionary

`Comparisons` are specified as part of the Splink `settings`, a Python dictionary which controls all of the configuration of a Splink model:

In [None]:
settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        cl.levenshtein_at_thresholds("first_name", 2),
        cl.levenshtein_at_thresholds("surname"),
        cl.levenshtein_at_thresholds("dob"),
        cl.exact_match("city", term_frequency_adjustments=True),
        cl.levenshtein_at_thresholds("email"),
    ],
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}

In words, this setting dictionary says:


* We are performing a `dedupe_only` (the other options are `link_only`, or `link_and_dedupe`, which may be used if there are multiple input datasets)
* When comparing records, we will use information from the `first_name`, `surname`, `dob`, `city` and `email` columns to compute a match score.
* The `blocking_rules_to_generate_predictions` states that we will only check for duplicates amongst records where either the `first_name` or `surname` is identical.
* We have enabled term frequency adjustments for the 'city' column, because some values (e.g. `London`) appear much more frequently than others
* We have set `retain_intermediate_calculation_columns` and `additional_columns_to_retain` to `True`  so that Splink outputs additional information that helps the user understand the calculations. If they were `False`, the computations would run faster.

## Estimate the parameters of the model

Now that we have specified our linkage model, we want to estimate its `m` and `u` parameters, and the [`probability_two_random_records_match`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#probability_two_random_records_match) parameter.

- The `m` values are the proportion of records falling into each `ComparisonLevel` amongst truly *matching* records

- The `u` values are the proportion of records falling into each `ComparisonLevel` amongst truly *non-matching* records

- The `probability_two_random_records_match` parameter is the probability that two records taken at random from your input data represent a match (typically a very small number).

You can read more about the theory of what these mean [here](https://www.robinlinacre.com/maths_of_fellegi_sunter/).


In general, we can estimate the `u` parameters and the `probability_two_random_records_match` using direct estimation techniques, even if we do not have labelled data.

If we do not have labelled data, the `m` parameters can be estimated using an iterative maximum likelihood approach called Exepctation Maximisation.  If we have labels, we can directly estimate the 



In [None]:
linker = DuckDBLinker(df, settings)

### Direct estimation of `probability_two_random_records_match`

In some cases, the `probability_two_random_records_match` will be known. For example, if you are linking two tables of 10,000 records and expect a one-to-one match, then you should set this value to `1/10_000` in your settings instead of estimating it.

More generally, this parameter is unknown and needs to be estimated.  

It can be estimated accurately enough for most purposes by combining a series of deterministic matching rules and an guess of the recall corresopnding to those rules.  For further details of the rationale behind this appraoch see [here](https://github.com/moj-analytical-services/splink/issues/462#issuecomment-1227027995).

In this example, I guess that the following deterministic matching rules have a recall of about 70%.

In [None]:
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

### Direct estimation of `u` probabilities

Next, we will use `estimate_u_using_random_sampling` method to compute the `u` values of the model.   

The larger the random sample, the more accurate the predictions.  You control this using the `max_pairs` parameter. For large datasets, we recommend using at least 10 million - but the higher the better and 1 billion is often appropriate for larger datasets.

In [None]:
linker.estimate_u_using_random_sampling(max_pairs=1e6)

We then use the expectation maximisation algorithm to train the `m` values.

This algorithm estimates the `m` values by generating pairwise record comparisons, and using them to maximise a likelihood function. 

Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a managable level.


In our first estimation pass, we block on `first_name` and `surname`, meaning we will generate all record comparisons that have `first_name` and `surname` exactly equal.   

Recall we are trying to estimate the `m` values of the model, i.e. proportion of records falling into each `ComparisonLevel` amongst truly matching records.

This means that, in this training session, we cannot estimate parameter estimates for the `first_name` or `surname` columns, since we have forced them to be equal 100% of the time.

We can, however, estimate parameter estimates for all of the other columns.  The output messages produced by Splink confirm this.

In [None]:
training_blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
training_session_fname_sname = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

In a second estimation pass, we block on dob. This allows us to estimate parameters for the `first_name` and `surname` comparisons.

Between the two estimation passes, we now have parameter estimates for all comparisons.

In [None]:
from numpy import fix


training_blocking_rule = "l.dob = r.dob"
training_session_dob = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

Note that Splink includes other algorithms for estimating m and u values, which are documented [here](https://moj-analytical-services.github.io/splink/linkerest.html).

## Visualising model parameters

Splink can generate a number of charts to help you understand your model.  For an introduction to these charts and how to interpret them, please see [this](https://www.youtube.com/watch?v=msz3T741KQI&t=507s) video.

The final estimated match weights can be viewed in the match weights chart:

In [None]:
linker.match_weights_chart()

In [None]:
linker.m_u_parameters_chart()

### Saving the model

Finally we can save the model, including our estimated parameters, to a `.json` file, so we can use it in the next tutorial.

In [None]:
linker.save_settings_to_json("./demo_settings/saved_model_from_demo.json", overwrite=True)

## Next steps

Now we have trained a model, we can move on to using it predict matching records


## Further reading

Full documentation for all of the ways of estimating model parameters can be found [here](https://moj-analytical-services.github.io/splink/linkerest.html).