# Customising splink - configuration and settings

## Summary

In the [Quickstart demo]("quickstart_demo.ipynb") we saw an example of how to use `splink`.  There was minimal customisation of the settings - the demo relies on the fact that when settings are not specified by the user, `splink` uses sensible default values.  

Not all settings have defaults - at minimum the user needs to choose the `link_type`, the `blocking_rules` and the `comparison_columns`.

In most real-world applications, more accurate results will be obtained by customising the settings, often by trial and error. 

The main way to do this is through the `settings` dictionary.  This is passed to `splink` like so:


```
from splink import Splink

settings = { }  # Settings dictionary goes here

linker = Splink(settings, spark, df=df)
```

This settings dictionary can be quite complicated, so this notebook provides details of the various settings and what they do.  

We recommend using it alongside our [autocompleting settings editor](https://robinlinacre.com/simple_sparklink_settings_editor/), which makes it quicker and easier to write settings dictionaries.  This validates the your settings against the [json schema](https://github.com/moj-analytical-services/sparklink/blob/dev/sparklink/files/settings_jsonschema.json) for the settings.  Note: you can use this schema in some text editors to enable autocompletion - e.g. see [here](https://code.visualstudio.com/docs/languages/json#_intellisense-and-validation) for VS Code.

You can also validate a settings object within Python with the following code:

```
from sparklink.validate import validate_settings
validate_settings(settings)
```

## Choosing the link type

There are three types of data linking or deduplication built into `splink`, which are configured by setting the `link_type` key of the settings dictionary.

In [4]:
from demo_notebooks.demo_utils import render_key_as_markdown
render_key_as_markdown("link_type")

**Summary**:
The type of data linking task - `dedupe_only`, `link_only` or `link_and_dedupe`.  Required.

**Description**:
- When `dedupe_only`, the user provides a single input dataframe, and `splink` tries to find duplicate entries
- When `link_only`, the user provides two dataframes, and `splink` tries to find a link between two two.  It makes no attempt to deduplicate datasets so this is best used when input datasets contain no duplicates
- When `link_and_dedupe`, the user provides two dataframes, and `splink` simultanouesly links and dedupes the dataframes.

**Data type**: string

**Possible values**: `dedupe_only`, `link_only`, `link_and_dedupe`

**Example**:

```
settings = {
    "link_type": "dedupe_only"
}
```

## Choosing your blocking rules (or cartesian join)

In most linking tasks, you will need to choose one or more [blocking rules](https://www.isi.edu/integration/papers/michelson06-aaai.pdf), which are used as a pre-processing step to eliminate implausible matches.  Without blocking, Apache Spark will compare all records to one another, which is usually computationally intractable (for linking problems of over about 10,000-50,000 records).  

These are specified using the `blocking_rules` key of the `settings` dictionary.

Alternatively, if you really do want to compare all records to one another, you can set the `cartesian_join` key instead.

### Blocking rules

In [2]:
from demo_notebooks.demo_utils import render_key_as_markdown
render_key_as_markdown("blocking_rules")

**Summary**:
A list of one or more blocking rules to apply. Ignored if cartesian_product=true is set

**Description**:
Each rule is a SQL expression representing the blocking rule, which will be used to create a join.  The left table is aliased with `l` and the right table is aliased with `r`. For example, if you want to block on a `first_name` column, the blocking rule would be `l.first_name = r.first_name`

**Data type**: array

**Example**:

```
settings = {
    "blocking_rules": ['l.first_name = r.first_name AND l.surname = r.surname', 'l.dob = r.dob']
}
```

### Cartesian join

In [5]:
render_key_as_markdown("cartesian_product")

**Summary**:
If set to true, all comparisons between the input datasets will be generated

**Description**:
For large input datasets, the will generally be computationally intractable because it will generate numrows squared comparisons

**Data type**: boolean

**Example**:

```
settings = {
    "cartesian_product": False
}
```

## Choosing which columns should be used to link data

The user must decide which columns in the input datasets will be used to link data.  For example, if the user is deduplicating a table of people, it would be typical to use personal identifiers like name and date of birth.

These columns, and configuration options for each individual column, are provided as a list of dictionaries assigned to the `column_settings` key of the `settings` dictionary.

For example:
```
settings = {
    "column_settings": [
    {
        "column_name": "first_name"
    },
    {
        "column_name": "latitude",
        "data_type": "numeric"}
    ]
}

```

**Note that, throughout the splink package, it is assumed that incoming datasets have been prepared so they have common column names - e.g. if you are linking two datasets on first name, the column containing this data has the same name in both datasets.  This cannot be configured and the package will not work without this preparation step.**

In [4]:
s = """- When `dedupe_only`, the user provides a single input dataframe, and `splink` tries to find duplicate entries
- When `link_only`, the user provides two dataframes, and `splink` tries to find a link between two two.  It makes no attempt to deduplicate datasets so this is best used when input datasets contain no duplicates
- When `link_and_dedupe`, the user provides two dataframes, and `splink` simultanouesly links and dedupes the dataframes."""
import json
print(json.dumps(s))

"- When `dedupe_only`, the user provides a single input dataframe, and `splink` tries to find duplicate entries\n- When `link_only`, the user provides two dataframes, and `splink` tries to find a link between two two.  It makes no attempt to deduplicate datasets so this is best used when input datasets contain no duplicates\n- When `link_and_dedupe`, the user provides two dataframes, and `splink` simultanouesly links and dedupes the dataframes."
