# Optimising configuration and best practices

It is rare that Splink's default settings will provide the best data linking results.  Significant improvements can usually be made with careful attention to data cleaning, comparison functions, and other configuration options.  

This notebook contains advice about best practice to help users get the more accurate data linkage results. It is based on experiences in optimising real-world jobs.

It focusses on introducing concepts and building intuition instead of providing extensive code or exhaustive details of every config option.  Please see the other examples in this repository for fully-working examples of code, and [here](https://moj-analytical-services.github.io/splink_settings_editor/) for a full list of configuration options. 

## Concepts

The [EM algorithm](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) that Splink uses to estimate model parameters is a unsupervised machine learning algorithm.  Specifically, it learns an optimal set of matching weights from the dataset of record comparisons.  

This gives us a framework for thinking about how best to use Splink because it means we can dip into the standard set tools and ways of thinking that we use for any machine learning problem.

Two things are particularly relevant for optimising of a Splink job:

- Feature engineering:  How to transform our data to help Splink learn as much as possible.
- How can we avoid confusing Splink with bad data or configuration?  (Overfitting, converging to local rather than global maxima, etc.)

A second ingredient to help structure our thoughts is an better understanding of what Splink is trying to learn.

In a nutshell, Splink looks to exploit differences in match probability between subsets of record comparisons.  The greater the differences, the more accurate the matching.

This means that the task of the user is generally to try and make these groups as different as possible (though without overfitting).

These concepts go a long way to explaining most of the optimisations in this notebook.

### Our example data

In the remainder of this notebook we will build up an example settings dictionary, based on an example dataset like this which we want to dedupe:


| first_name | surname | initials | gender | dob        | postcode | office           |
|------------|---------|----------|--------|------------|----------|------------------|
| robin      | linacre | rl       | M      | 2000-01-01 | TA12 9PD | Bristol          |
| john       | smith   | js       | NULL   | 1955-02-03 | BA8      | Manchester       |



    

## Step 1. Decide on a list of comparison columns

The user must decide which columns to include in the list of `comparison_columns`.

The comparisons columns are used to subset the data into different groups - for instance, the subset of record comparisons where first name matches is likely to contain a greater proportion of matches than the subset of record comparisons where first name does not match.

It is usually best to include any columns containing information that may help us accept or reject a match.

**⚠️ POTENTIAL TRAP ⚠️ :**  Do not include columns that repeat information in other columns because this will then be double counted. In this case, the person's initials should not be included as a separate comparison column.   

Where columns are highly correlated, consider including only one, since the Fellegi Sunter model assumes independence.  Violation of this assumption can often result in some degree of double counting.  In this case, office is likely to be highly correlated with postcode, so we would advice against inclusion.


At this stage, we have the following settings

```python

settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "first_name"
        },
        {
            "col_name": "surname"
        },
        {
            "col_name": "gender"
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "postcode"
        }
    ]
}
```







## Step 2:  Chose the number of levels for each comparison column

Next, the user much choose the number of levels for each elements of the `comparison_columns` list.  If omitted, it defaults to 2.

Consider specifically the `first_name` and `gender` comparison columns:

```python

settings = {
    ...
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3
        },
        ...
        {
            "col_name": "gender",
            "num_levels": 2
        },
        ...
    ]
}
```

What does `num_levels` mean and why may we want three for the name columns, but two for gender?

We are seeking to exploit differenecs in the distribution of matches between different subsets.

For a first_name field, it's reasonable to assume there may be different match rates among three groups:
- The group of record comparisons where first name matches exactly.
- The group of record comparisons where first names are similar but not exactly the same
- The group of record comparisons where first names are not similar.

`"num_levels": 3` creates these groups, enabling Splink to estimate different match weights for each group.

Another way to understand why three levels is important is to consider the cost of using only two levels:
- The group of record comparisons where first name matches exactly.
- The group of record comparisons where first name does not match exactly.

The later group contains both record comparisons where the name almost matches, and ones where names does match at all.  The algorithm has to 'take an average' - which will result in the estimated match probability for 'almost matches' being scored down too harshly, and the 'does not match at all' records not being scored down enough.

For the gender column, which contains 'M', 'F', or null, only a two-level comparison is reasonable: it either matches or it doesn't.  

### Tradeoffs

Taken to its logical extreme, we could consider having a very large number of levels to account for subtle difference in distributions between groups of records with slighly different characterstics (e.g. one group for each possible value of edit distance).

There are two downsides to increasing the number of levels:
- Overfitting.  The more parameters you have, the more likely estimates are to be influenced by specific characteristics of your training data.  With many levels, some groups will be very small, which can lead to extreme parameter estimates.
- Computational complexity.  More levels means longer compute times.

Empirically, we have found that 3 or 4 levels is generally suitable for columns where string comparison functions are being used.

## Step 3: Customising the comparison 

The nbe
### 

```python
settings = {...
            "comparison_columns": [
                {            
                "custom_name": "postcode_office_custom"
                "case_expression": custom_sql_case_expression_goes_here,
                "custom_columns_used": ["postcode", "office"],
                "num_levels": 3
                }
                ]
           }
``` 

In [11]:

A variety of string comparison functions are provided in a [jar](https://github.com/moj-analytical-services/splink/tree/master/jars) accompanying splink. A code snippet showing how to load them into Spark can be found [here](https://github.com/moj-analytical-services/splink_demos/blob/master/code_snippets/loading_jar.py). 

The functions provided are:
- **[Jaro-Winkler similarity](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Jaro%E2%80%93Winkler_Similarity).**  A normalised edit distance running from 0 (no match) to 1 (exact match). Places greater weight on earlier characters. Generally a good option for names and dictionary words.
- [**Jaccard similarity**](https://en.wikipedia.org/wiki/Jaccard_index).  A token-based similarity measure that considers the number of common tokens.  May be appropriate for some unique identifiers such as a driving licence number that may be entered with error.
- [**Cosine distance**](https://en.wikipedia.org/wiki/Cosine_similarity).  A measure of similarity between different substrings.  Particularly appropriate for longer text strings that cannot be parsed out into separate fields.

Splink defaults to using [Jaro-Winkler similarity](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Jaro%E2%80%93Winkler_Similarity) for string comparisons, which is generally a good choice for names or dictionary words.







SyntaxError: invalid syntax (<ipython-input-11-8ef235231fce>, line 1)

In [None]:
## Multicollinearity

In [None]:
## Identifying problems

In [None]:
## Use of multiple blocking rules

Try and ensure a large number of comparisons


```python

settings = {
    ...
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "u_probabilities": [
                    0.7,
                    0.2,
                    0.1
                ],
            "m_probabilities": [
                    0.1,
                    0.2,
                    0.7
                ]
        },
        ...
    ]
}



In [None]:
## Number of levels 

In [None]:
## Blocking rules



In [None]:
# Other considerations

In [None]:
## Tradeoffs

In [None]:
## Understanding splink defaults

Demo of complete settings dictionary function 

In [9]:
## Term frequency adjustments 

```python
    
    custom_expression = """
    CASE
    WHEN first_name_l is null or first_name_r is null THEN -1
    WHEN first_name_ =  first_name_r THEN 2
    WHEN dmetaphone(first_name_l) = dmetaphone(first_name_r) THEN 1
    ELSE 0 
    END
    """
    
    
    settings = {
        "comparison_columns": [ 
            {
            "col_name": "first_name"
            "num_levels": 3,
            "case_expression" custom_expression   
            }
        ]
    }
```