# Defining and customising how record comparisons are made

A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types.  For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.

By tailoring the definitions of similarity, linking models are more effectively able to distinguish between different gradations of similarity, leading to more accurate data linking models.

For more detail, see [the article on Comparison and ComparisonLevels](./comparisons_and_comparison_levels.html)


## `Comparisons`, `ComparisonTemplates` and `ComparisonLevels`

A Splink model contains a collection of `Comparisons` and `ComparisonLevels` organised in a hierarchy.  An example is as follows:

```
Data Linking Model
├─-- Comparison: Date of birth
│    ├─-- ComparisonLevel: Exact match
│    ├─-- ComparisonLevel: Up to one character difference
│    ├─-- ComparisonLevel: Up to three character difference
│    ├─-- ComparisonLevel: All other
├─-- Comparison: Name
│    ├─-- ComparisonLevel: Exact match on first name and surname
│    ├─-- ComparisonLevel: Exact match on first name
│    ├─-- etc.
```

A fuller description of `Comparison`s and `ComparisonLevel`s can be found [here](https://moj-analytical-services.github.io/splink/comparisons_and_comparison_levels.html).


How are these comparisons specified?



### Three ways of specifying Comparisons

In Splink, there are three ways of specifying `Comparisons`:

- Using pre-baked comparisons from a backend's `ComparisonLibrary` or `ComparisonTemplateLibrary`.   (Most simple/succinct)
- Composing pre-defined `ComparisonLevels` from a backend's `ComparisonLevelLibrary`
- Writing a full spec of a `Comparison` by hand (most verbose/flexible)

<hr>

## Method 1: Using the `ComparisonLibrary` or `ComparisonTemplateLibrary`

The `ComparisonLibrary` and `ComparisonTemplateLibrary` contains pre-baked similarity functions that cover many common use cases.

These functions generate an entire `Comparison`, composed of several `ComparisonLevels`

The following provides an example of using the `ExactMatch` `Comparison`, and producing the description (with associated SQL) for the `duckdb` backend:

In [8]:
import splink.comparison_library as cl

first_name_comparison = cl.ExactMatch("first_name")
print(first_name_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'Exact match 'first_name' vs. anything else' of "first_name".
Similarity is assessed using the following ComparisonLevels:
    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'All other comparisons' with SQL rule: ELSE



Note that, under the hood, these functions generate a Python dictionary, which conforms to the underlying `.json` specification of a model:

In [21]:
first_name_comparison.get_comparison("duckdb").as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match on first_name'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': "Exact match 'first_name' vs. anything else"}

We can now generate a second, more complex comparison:

In [22]:
dob_comparison = cl.LevenshteinAtThresholds("dob", [1, 2])
print(dob_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'Exact match 'dob' vs. Levenshtein distance at thresholds 1, 2 vs. anything else' of "dob".
Similarity is assessed using the following ComparisonLevels:
    - 'dob is NULL' with SQL rule: "dob_l" IS NULL OR "dob_r" IS NULL
    - 'Exact match on dob' with SQL rule: "dob_l" = "dob_r"
    - 'Levenshtein distance of dob <= 1' with SQL rule: levenshtein("dob_l", "dob_r") <= 1
    - 'Levenshtein distance of dob <= 2' with SQL rule: levenshtein("dob_l", "dob_r") <= 2
    - 'All other comparisons' with SQL rule: ELSE



These `Comparisons` can be specified in a data linking model as follows:

In [14]:
from splink import SettingsCreator, block_on

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparison_columns=[
        cl.ExactMatch("first_name"),
        cl.LevenshteinAtThresholds("dob", [1, 2]),
    ],
)

You can find a listing of all available `Comparison`s in the `ComparisonLibrary` [here](../../api_docs/comparison_template_library.html)

The `ComparisonTemplateLibrary` is very similar to the `ComparisonLibrary`, except that it contains out-of-the box functions appropriate for certain specific column types, such as names, email addresses and so on.

In [26]:
from splink import comparison_template_library as ctl

print(
    ctl.EmailComparison("email_address")
    .get_comparison("duckdb")
    .human_readable_description
)

Comparison 'jaro_winkler on username at threshold 0.88 vs. anything else' of "email_address".
Similarity is assessed using the following ComparisonLevels:
    - 'email_address is NULL' with SQL rule: "email_address_l" IS NULL OR "email_address_r" IS NULL
    - 'Exact match on email_address' with SQL rule: "email_address_l" = "email_address_r"
    - 'Exact match on transformed email_address' with SQL rule: NULLIF(regexp_extract("email_address_l", '^[^@]+', 0), '') = NULLIF(regexp_extract("email_address_r", '^[^@]+', 0), '')
    - 'Jaro-Winkler distance of email_address >= 0.88' with SQL rule: jaro_winkler_similarity("email_address_l", "email_address_r") >= 0.88
    - 'Jaro-Winkler distance of transformed email_address >= 0.88' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract("email_address_l", '^[^@]+', 0), ''), NULLIF(regexp_extract("email_address_r", '^[^@]+', 0), '')) >= 0.88
    - 'All other comparisons' with SQL rule: ELSE



You can find a listing of all available `ComparisonTemplates` in the `ComparisonTemplateLibrary` [here](../../api_docs/comparison_template_library.html)

For a deep dive on Comparison Templates, see the dedicated [topic guide](./comparison_templates.ipynb).



## Method 2: `ComparisonLevels`

The `ComparisonLevels` API provides a lower-level API that allows you to compose your own comparisons.

For example, the user may wish to specify a comparison that has levels for a match on dmetaphone and jaro_winkler of the `first_name` field.  

The below example assumes the user has derived a column `dmeta_first_name` which contains the dmetaphone of the first name.

In [3]:
from splink.comparison_library import CustomComparison
import splink.comparison_level_library as cll

custom_name_comparison = CustomComparison(
    output_column_name="first_name",
    comparison_description="First name jaro dmeta",
    comparison_levels=[
        cll.NullLevel("first_name"),
        cll.ExactMatchLevel("first_name").configure(tf_adjustment_column="first_name"),
        cll.ExactMatchLevel("dmeta_first_name").configure(
            tf_adjustment_column="dmeta_first_name"
        ),
        cll.ElseLevel(),
    ],
)

print(custom_name_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'First name jaro dmeta' of "first_name" and "dmeta_first_name".
Similarity is assessed using the following ComparisonLevels:
    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'Exact match on dmeta_first_name' with SQL rule: "dmeta_first_name_l" = "dmeta_first_name_r"
    - 'All other comparisons' with SQL rule: ELSE



This can now be specified in the settings dictionary as follows:

In [18]:
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        custom_name_comparison,
        cl.LevenshteinAtThresholds("dob", [1, 2]),
    ],
)

To inspect the custom comparison as a dictionary, you can call `custom_name_comparison.get_comparison("duckdb").as_dict()`

<hr>

## Method 3: Providing the spec as a dictionary

Behind the scenes in Splink, all `Comparisons` are eventually turned into a dictionary which conforms to [the formal `jsonschema` specification of the settings dictionary](https://github.com/moj-analytical-services/splink/blob/master/splink/files/settings_jsonschema.json) and [here](https://moj-analytical-services.github.io/splink/).

The library functions described above are convenience functions that provide a shorthand way to produce valid dictionaries.

For maximum control over your settings, you can specify your comparisons as a dictionary.

In [19]:
comparison_first_name = {
    "output_column_name": "first_name",
    "comparison_description": "First name jaro dmeta",
    "comparison_levels": [
        {
            "sql_condition": "first_name_l IS NULL OR first_name_r IS NULL",
            "label_for_charts": "Null",
            "is_null_level": True,
        },
        {
            "sql_condition": "first_name_l = first_name_r",
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 1.0,
            "tf_minimum_u_value": 0.001,
        },
        {
            "sql_condition": "dmeta_first_name_l = dmeta_first_name_r",
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "dmeta_first_name",
            "tf_adjustment_weight": 1.0,
        },
        {
            "sql_condition": "jaro_winkler_sim(first_name_l, first_name_r) > 0.8",
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 0.5,
            "tf_minimum_u_value": 0.001,
        },
        {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
    ],
}

SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        comparison_first_name,
        cl.LevenshteinAtThresholds("dob", [1, 2]),
    ],
)


<hr>

## Examples

Below are some examples of how you can define the same comparison, but through different methods.


### Exact match Comparison with Term-Frequency Adjustments

???+ example 

    ===+ "Comparison Library"

        ```py
        import splink.comparison_library as cl

        first_name_comparison = cl.ExactMatch("first_name").configure(
            term_frequency_adjustments=True
        )
        ```

    === "Comparison Level Library"

        ```py
        import splink.duckdb.comparison_level_library as cll

        first_name_comparison = cl.CustomComparison(
            output_column_name="first_name",
            comparison_description="Exact match vs. anything else",
            comparison_levels=[
                cll.NullLevel("first_name"),
                cll.ExactMatchLevel("first_name").configure(tf_adjustment_column="first_name"),
                cll.ElseLevel(),
            ],
        )
        ```
        
    === "Settings Dictionary"

        ```py
        first_name_comparison = {
            'output_column_name': 'first_name',
            'comparison_levels': [
                {
                    'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
                    'label_for_charts': 'Null',
                    'is_null_level': True
                },
                {
                    'sql_condition': '"first_name_l" = "first_name_r"',
                    'label_for_charts': 'Exact match',
                    'tf_adjustment_column': 'first_name',
                    'tf_adjustment_weight': 1.0
                },
                {
                    'sql_condition': 'ELSE', 
                    'label_for_charts': 'All other comparisons'
                }],
            'comparison_description': 'Exact match vs. anything else'
        }

        ```
    Each of which gives

    ```json
    {
        'output_column_name': 'first_name',
        'comparison_levels': [
            {
                'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
                'label_for_charts': 'Null',
                'is_null_level': True
            },
            {
                'sql_condition': '"first_name_l" = "first_name_r"',
                'label_for_charts': 'Exact match',
                'tf_adjustment_column': 'first_name',
                'tf_adjustment_weight': 1.0
            },
            {
                'sql_condition': 'ELSE', 
                'label_for_charts': 'All other comparisons'
            }],
        'comparison_description': 'Exact match vs. anything else'
    }
    ```
    in your settings dictionary.

### Levenshtein Comparison

??? example

    ===+ "Comparison Library"

        ```py
        import splink.comparison_library as cl

        email_comparison = cl.LevenshteinAtThresholds("email", [2, 4])
        ```

    === "Comparison Level Library"

        ```py
        import splink.comparison_library as cl
        import splink.comparison_level_library as cll

        email_comparison = cl.CustomComparison(
            output_column_name="email",
            comparison_description="Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else",
            comparison_levels=[
                cll.NullLevel("email"),
                cll.LevenshteinLevel("email", distance_threshold=2),
                cll.LevenshteinLevel("email", distance_threshold=4),
                cll.ElseLevel(),
            ],
        )
        ```

    === "Settings Dictionary"

        ```py
        email_comparison = {
            'output_column_name': 'email',
            'comparison_levels': [{'sql_condition': '"email_l" IS NULL OR "email_r" IS NULL',
            'label_for_charts': 'Null',
            'is_null_level': True},
            {
                'sql_condition': '"email_l" = "email_r"',
                'label_for_charts': 'Exact match'
            },
            {
                'sql_condition': 'levenshtein("email_l", "email_r") <= 2',
                'label_for_charts': 'Levenshtein <= 2'
            },
            {
                'sql_condition': 'levenshtein("email_l", "email_r") <= 4',
                'label_for_charts': 'Levenshtein <= 4'
            },
            {
                'sql_condition': 'ELSE', 
                'label_for_charts': 'All other comparisons'
            }],
            'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'}
        ```

    Each of which gives

    ```json
    {
        'output_column_name': 'email',
        'comparison_levels': [
            {
                'sql_condition': '"email_l" IS NULL OR "email_r" IS NULL',
                'label_for_charts': 'Null',
                'is_null_level': True},
            {
                'sql_condition': '"email_l" = "email_r"',
                'label_for_charts': 'Exact match'
            },
            {
                'sql_condition': 'levenshtein("email_l", "email_r") <= 2',
                'label_for_charts': 'Levenshtein <= 2'
            },
            {
                'sql_condition': 'levenshtein("email_l", "email_r") <= 4',
                'label_for_charts': 'Levenshtein <= 4'
            },
            {
                'sql_condition': 'ELSE', 
                'label_for_charts': 'All other comparisons'
            }],
        'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'
    }
    ```

    in your settings dictionary.