# Customising splink - configuration and settings

## Summary

In the [Quickstart demo]("quickstart_demo.ipynb") we saw an example of how to use `splink`.  There was minimal customisation of the settings - the demo relies on the fact that when settings are not specified by the user, `splink` uses sensible default values.  

Not all settings have defaults - at minimum the user needs to choose the `link_type`, the `blocking_rules` and the `comparison_columns`.

In most real-world applications, more accurate results will be obtained by customising the settings, often by trial and error. 

The main way to do this is through the `settings` dictionary.  This is passed to `splink` like so:


```
from splink import Splink

settings = { }  # Settings dictionary goes here

linker = Splink(settings, spark, df=df)
```

You can view 'sensible default' values that have been chosen for the other columns as follows:

```
from splink.settings import complete_settings_dict
complete_settings_dict(settings)
```

This settings dictionary can be quite complicated, so this notebook provides details of the various settings and what they do.  

We recommend using it alongside our [autocompleting settings editor](https://robinlinacre.com/simple_sparklink_settings_editor/), which makes it quicker and easier to write settings dictionaries.  This validates the your settings against the [json schema](https://github.com/moj-analytical-services/sparklink/blob/dev/sparklink/files/settings_jsonschema.json) for the settings.  Note: you can use this schema in some text editors to enable autocompletion - e.g. see [here](https://code.visualstudio.com/docs/languages/json#_intellisense-and-validation) for VS Code.



You can also validate a settings object within Python with the following code:

```
from sparklink.validate import validate_settings
validate_settings(settings)
```



## Choosing the link type

There are three types of data linking or deduplication built into `splink`, which are configured by setting the `link_type` key of the settings dictionary.

In [1]:
from demo_notebooks.demo_utils import render_key_as_markdown
render_key_as_markdown("link_type")

**Summary**:
The type of data linking task - `dedupe_only`, `link_only` or `link_and_dedupe`.  Required.

**Description**:
- When `dedupe_only`, the user provides a single input dataframe, and `splink` tries to find duplicate entries
- When `link_only`, the user provides two dataframes, and `splink` tries to find a link between two two.  It makes no attempt to deduplicate datasets so this is best used when input datasets contain no duplicates
- When `link_and_dedupe`, the user provides two dataframes, and `splink` simultanouesly links and dedupes the dataframes.

**Data type**: string

**Possible values**: `dedupe_only`, `link_only`, `link_and_dedupe`

**Example**:

```
settings = {
    "link_type": "dedupe_only"
}
```

## Choosing your blocking rules (or cartesian join)

In most linking tasks, you will need to choose one or more [blocking rules](https://www.isi.edu/integration/papers/michelson06-aaai.pdf), which are used as a pre-processing step to eliminate implausible matches.  Without blocking, Apache Spark will compare all records to one another, which is usually computationally intractable (for linking problems of over about 10,000-50,000 records).  

These are specified using the `blocking_rules` key of the `settings` dictionary.

Alternatively, if you really do want to compare all records to one another, you can set the `cartesian_join` key instead.

### Blocking rules

In [2]:
from demo_notebooks.demo_utils import render_key_as_markdown
render_key_as_markdown("blocking_rules")

**Summary**:
A list of one or more blocking rules to apply. Ignored if cartesian_product=true is set

**Description**:
Each rule is a SQL expression representing the blocking rule, which will be used to create a join.  The left table is aliased with `l` and the right table is aliased with `r`. For example, if you want to block on a `first_name` column, the blocking rule would be `l.first_name = r.first_name`.  Note that splink deduplicates the comparisons generated by the blocking rules.

**Data type**: array

**Example**:

```
settings = {
    "blocking_rules": ['l.first_name = r.first_name AND l.surname = r.surname', 'l.dob = r.dob']
}
```

### Cartesian join

In [3]:
render_key_as_markdown("cartesian_product")

**Summary**:
If set to true, all comparisons between the input dataset(s) will be generated and blocking will not be used.

**Description**:
For large input datasets, the will generally be computationally intractable because it will generate comparisons equal to the number of rows squared.

**Data type**: boolean

**Default value if not provided**: False

**Example**:

```
settings = {
    "cartesian_product": False
}
```

## Choosing which columns should be used to link data

The user must decide which columns in the input datasets will be used to link data.  

For example, if the user is deduplicating a table of people, it would be typical to use personal identifiers like name and date of birth.

These columns, and configuration options for each individual column, are provided as a list of dictionaries assigned to the `comparison_columns` key of the `settings` dictionary.

At a minimum, each entry in the list must have `column_name` populated.  All other settings have sensible defaults.

For example:
```
settings = {
    "comparison_columns": [
    {
        "column_name": "first_name"
    },
    {
        "column_name": "latitude",
        "data_type": "numeric"}
    ]
}

```

**Note that, throughout the splink package, it is assumed that incoming datasets have been prepared so they have common column names - e.g. if you are linking two datasets on first name, the column containing this data has the same name in both datasets.  This cannot be configured and the package will not work without this preparation step.**

## Choosing a prior for proportion of matches

`splink` is more likely to iterate towards good parameter values if the starting values are good guesses.  An important setting is `proportion_of_matches`, which is the starting value (prior belief) for the proportion of comparisons which are matches.  

In [14]:
render_key_as_markdown("proportion_of_matches")

**Summary**:
The proportion of comparisons thought to be matches

**Description**:
This provides the initial value (prior) that EM algorithm will start iterating from

**Data type**: number

**Default value if not provided**: 0.3

**Example**:

```
settings = {
    "proportion_of_matches": 0.3
}
```

## Configuring individual columns

Each entry in the `comparison_columns` list is a dictionary, which enables the user to provide additional customisation options for each individual comparison.

### Number of levels 

In [3]:
render_key_as_markdown("num_levels", True)

**Summary**:
The number of different similarity levels that will be computed for this column

**Description**:
A greater value for `num_levels` means the algorithm can be more precise about how string similarity is treated - e.g. by making a distinction between strings which are an almost-exact match, strings which are quite similar, and strings which don't really match at all.  However, more levels results in longer compute times and can sometimes affect convergence. By default, for a string variable, two levels would implies level 0: no match, level 1: almost exact match.  Three levels imples level 0: no match, level 1: strings are similar but not exactly the same, level 2: strings are almost exactly the same.

**Data type**: integer

**Default value if not provided**: 2

**Example**:

```
settings = {
    "comparison_columns: [
    {
        "num_levels": 2
    }
]
```

### Data type

In [4]:
from demo_notebooks.demo_utils import render_key_as_markdown
render_key_as_markdown("data_type", True)

**Summary**:
The data type of the column.  This is used to choose how similarity is assessed this column. This is ignored if you explicitly provide a case_expression.

**Description**:
- If `string` is specified, `splink` will use the Jaro Winkler string comparison functions.
- If `numeric` is specified, then similarity will be assessed based on the absolute percentage difference between the two values.

**Data type**: string

**Possible values**: `string`, `numeric`

**Default value if not provided**: string

**Example**:

```
settings = {
    "comparison_columns: [
    {
        "data_type": "string"
    }
]
```

### Term frequency adjustments


In [10]:
render_key_as_markdown("term_frequency_adjustments", True)


**Description**:
Whether ex post term frequency adjustments should be made to match scores for this column

**Data type**: boolean

**Default value if not provided**: False

### Case expression

In [11]:
render_key_as_markdown("case_expression", True)

**Summary**:
A SELECT CASE expression that compares the values of the input column and returns integer values corresponding to num_levels. 

**Description**:
This is an override which allows the user to cusomise how similarity is computed for this column.  If given, this overrides the default mechanism of comparing columns and ignores data_type

**Data type**: string

**Example**:

```
settings = {
    "comparison_columns: [
    {
        "case_expression": "CASE WHEN first_name_l = first_name_r THEN 1 ELSE 0 END"
    }
]
```

## Example of a full settings dictionary

In [None]:
settings = {
    "link_type": "dedupe_only",
    "proportion_of_matches": 0.6,
    "blocking_rules": [
        "l.first_name = r.first_name AND l.surname = r.surname",
        "l.dob = r.dob AND l.surname = r.surname"
    ],
    "comparison_columns": [
        {
        "col_name": "first_name",
        "num_levels": 3,
        "term_frequency_adjustments": true
        },
        {
        "col_name": "surname",
        "num_levels": 3,
        "term_frequency_adjustments": true
        },
        {
        "col_name": "date_of_birth",
        "case_expression": "CASE WHEN l.date_of_birth = r.date_of_birth THEN 2 WHEN year(l.date_of_birth) = year(r.date_of_bitth) THEN 1 ELSE 0 END"
        },
        {
        "col_name": "income_gbp",
        "data_type": "numeric"
        }
    ]
}