# Out-of-the-box Comparisons for specific data types

Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The [Comparison Template Library](customising_comparisons.ipynb#method-2-using-the-comparisontemplatelibrary) contains functions to generate read-made comparisons for a variety of data types.

Below are examples of how to structure comparisons for a variety of data types.

<hr>

## Date Comparisons

Date comparisons are generally structured as: 

- Null level  
- Exact match  
- Fuzzy match ([using metric of choice](comparators.md))  
- Interval match (within X days/months/years)  
- Else level

The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box.

In [2]:
from splink.duckdb.duckdb_comparison_template_library import date_comparison

date_of_birth_comparison = date_comparison("date_of_birth")

Gives a comparison structured as follows:

```
Comparison: Date of birth
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: Up to one character difference
├─-- ComparisonLevel: Up to two character difference
├─-- ComparisonLevel: Dates within 1 year of each other
├─-- ComparisonLevel: Dates within 10 years of each other
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `date_of_birth_comparison`:

In [3]:
print(date_of_birth_comparison.human_readable_description)

Comparison 'Exact match vs. Dates within levenshtein thresholds 1, 2 vs. Dates within the following thresholds Year(s): 1, Year(s): 10 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
    - 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'Levenshtein <= 1' with SQL rule: levenshtein("date_of_birth_l", "date_of_birth_r") <= 1
    - 'Levenshtein <= 2' with SQL rule: levenshtein("date_of_birth_l", "date_of_birth_r") <= 2
    - 'Within 1 year' with SQL rule: 
        abs(date_diff('year', "date_of_birth_l", "date_of_birth_r")) <= 1
    
    - 'Within 10 years' with SQL rule: 
        abs(date_diff('year', "date_of_birth_l", "date_of_birth_r")) <= 10
    
    - 'All other comparisons' with SQL rule: ELSE



The [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

In [2]:
date_of_birth_comparison = date_comparison(
    "date_of_birth",
    levenshtein_thresholds=[],
    jaro_winkler_thresholds=[0.88],
    datediff_thresholds=[1, 1],
    datediff_metrics=["month", "year"],
)
print(date_of_birth_comparison.human_readable_description)

Comparison 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
    - 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity("date_of_birth_l", "date_of_birth_r") >= 0.88
    - 'Within 1 month' with SQL rule: 
        abs(date_diff('month', "date_of_birth_l", "date_of_birth_r")) <= 1
    
    - 'Within 1 year' with SQL rule: 
        abs(date_diff('year', "date_of_birth_l", "date_of_birth_r")) <= 1
    
    - 'All other comparisons' with SQL rule: ELSE



To see this as a specifications dictionary you can call

In [3]:
date_of_birth_comparison.as_dict()

{'output_column_name': 'date_of_birth',
 'comparison_levels': [{'sql_condition': '"date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"date_of_birth_l" = "date_of_birth_r"',
   'label_for_charts': 'Exact match'},
  {'sql_condition': 'jaro_winkler_similarity("date_of_birth_l", "date_of_birth_r") >= 0.88',
   'label_for_charts': 'Jaro_winkler_similarity >= 0.88'},
  {'sql_condition': '\n        abs(date_diff(\'month\', "date_of_birth_l", "date_of_birth_r")) <= 1\n    ',
   'label_for_charts': 'Within 1 month'},
  {'sql_condition': '\n        abs(date_diff(\'year\', "date_of_birth_l", "date_of_birth_r")) <= 1\n    ',
   'label_for_charts': 'Within 1 year'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else'}

Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Name Comparisons

Name comparisons for an individual name column (e.g. forename, surname) are generally structured as: 

- Null level  
- Exact match  
- Fuzzy match ([using metric of choice](comparators.md))  
- Else level

The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box.

In [4]:
from splink.duckdb.duckdb_comparison_template_library import name_comparison

first_name_comparison = name_comparison("first_name")

Gives a comparison structured as follows:

```
Comparison: First Name
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.95
├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.88
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `first_name_comparison`:

In [5]:
print(first_name_comparison.human_readable_description)

Comparison 'Exact match vs. Names within jaro_winkler thresholds 0.95, 0.88 vs. anything else' of "first_name".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'Jaro_winkler_similarity >= 0.95' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.95
    - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88
    - 'All other comparisons' with SQL rule: ELSE



The [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function also allowing flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

In [5]:
surname_comparison = name_comparison(
    "surname",
    phonetic_col_name="surname_dm",
    term_frequency_adjustments_name=True,
    levenshtein_thresholds=[2],
    jaro_winkler_thresholds=[],
    jaccard_thresholds=[1],
)
print(surname_comparison.human_readable_description)

Comparison 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else' of "surname" and "surname_dm".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "surname_l" IS NULL OR "surname_r" IS NULL
    - 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match surname_dm' with SQL rule: "surname_dm_l" = "surname_dm_r"
    - 'Levenshtein <= 2' with SQL rule: levenshtein("surname_l", "surname_r") <= 2
    - 'Jaccard >= 1' with SQL rule: jaccard("surname_l", "surname_r") >= 1
    - 'All other comparisons' with SQL rule: ELSE



Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md).

To see this as a specifications dictionary you can call

In [6]:
surname_comparison.as_dict()

{'output_column_name': 'custom_surname_surname_dm',
 'comparison_levels': [{'sql_condition': '"surname_l" IS NULL OR "surname_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"surname_l" = "surname_r"',
   'label_for_charts': 'Exact match surname',
   'tf_adjustment_column': 'surname',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"surname_dm_l" = "surname_dm_r"',
   'label_for_charts': 'Exact match surname_dm'},
  {'sql_condition': 'levenshtein("surname_l", "surname_r") <= 2',
   'label_for_charts': 'Levenshtein <= 2'},
  {'sql_condition': 'jaccard("surname_l", "surname_r") >= 1',
   'label_for_charts': 'Jaccard >= 1'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else'}

Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Forename and Surname Comparisons

It can be helpful to construct a single comparison for for comparing the forename and surname of two records as:

1. The Fellegi-Sunter model **assumes that comparisons are independent**. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.

2. **Term-frequencies** of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are. 
For example, in the UK population “Mohammed Khan” is a relatively common full name despite neither name occurring frequently. For more information on term-frequencies, see the dedicated [topic guide](term-frequency.md).
Addressing forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.

3. It is common for some records to have **swapped forename and surname by mistake**. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.


Forename and Surname comparisons for an individual name column (e.g. forename, surname) are generally structured as: 

- Null level  
- Exact match Forename and Surname
- Exact match Forename and Surname swapped
- Exact match Surname
- Exact match Forename
- Fuzzy match Surname ([using metric of choice](comparators.md))
- Fuzzy match Forename ([using metric of choice](comparators.md))
- Else level

The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [forename_surname_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box.

In [4]:
from splink.duckdb.duckdb_comparison_template_library import forename_surname_comparison

name_comparison = forename_surname_comparison("forename", "surname")

Gives a comparison structured as follows:

```
Comparison: First Name
├─-- ComparisonLevel: Exact match Forename and Surname
├─-- ComparisonLevel: Exact match Forename and Surname swapped
├─-- ComparisonLevel: Exact match Surname
├─-- ComparisonLevel: Exact match Forename
├─-- ComparisonLevel: Surnames with Jaro-Winkler similarity greater than 0.88
├─-- ComparisonLevel: Forenames with Jaro-Winkler similarity greater than 0.88
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `first_name_comparison`:

In [5]:
print(name_comparison.human_readable_description)

Comparison 'Exact match vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within jaro-winkler threshold 0.88 vs. Forename within jaro-winkler threshold 0.88 vs. anything else' of "forename" and "surname".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
    - 'Full name exact match' with SQL rule: "forename_l" = "forename_r" AND "surname_l" = "surname_r"
    - 'Exact match on reversed cols' with SQL rule: "forename_l" = "surname_r" and "forename_r" = "surname_l"
    - 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match forename' with SQL rule: "forename_l" = "forename_r"
    - 'Jaro_winkler_similarity surname >= 0.88' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.88
    - 'Jaro_winkler_similarity forename >= 0.88' with SQL rule: jaro_winkler_simil

The [forename_surname_comparison](../comparison_template_library.md#splink.comparison_template_library.ForenameSurnameComparisonBase) function also allowing flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

In [9]:
full_name_comparison = forename_surname_comparison(
    "forename",
    "surname",
    term_frequency_adjustments=True,
    tf_adjustment_col_forename_and_surname="full_name",
    phonetic_forename_col_name="forename_dm",
    phonetic_surname_col_name="surname_dm",
    levenshtein_thresholds=[2],
    jaro_winkler_thresholds=[],
    jaccard_thresholds=[1],
)
print(full_name_comparison.human_readable_description)

Comparison 'Exact match vs. Phonetic match forename and surname vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. Forename within levenshtein threshold 2 vs. Forename within jaccard threshold 1 vs. anything else' of "forename", "surname", "forename_dm" and "surname_dm".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
    - 'Full name exact match' with SQL rule: "forename_l" = "forename_r" AND "surname_l" = "surname_r"
    - 'Full name phonetic match' with SQL rule: "forename_dm_l"="forename_dm_r" AND "surname_dm_l" = "surname_dm_r"
    - 'Exact match on reversed cols' with SQL rule: "forename_l" = "surname_r" and "forename_r" = "surname_l"
    - 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match 

Where:

- `forename_dm` and `surname_dm` refer to columns which have used the DoubleMetaphone algorithm on `forename` and `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md). These columns will have to already exist in the dataset, or be created in the [feature engineering](feature_engineering.md#phonetic-transformations) stage when preparing datasets for linking.

- `full_name` is a column containing `forename` and `surname` so that the model can consider the term-frequency of the full name, as well as `forename` and `surname` individually. These columns will have to already exist in the dataset, or be created in the [feature engineering](feature_engineering.md#full-name) stage when preparing datasets for linking.

To see this as a specifications dictionary you can call

In [10]:
full_name_comparison.as_dict()

{'output_column_name': 'custom_forename_surname_forename_dm_surname_dm',
 'comparison_levels': [{'sql_condition': '("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"forename_l" = "forename_r" AND "surname_l" = "surname_r"',
   'label_for_charts': 'Full name exact match',
   'tf_adjustment_column': 'full_name',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"forename_dm_l"="forename_dm_r" AND "surname_dm_l" = "surname_dm_r"',
   'label_for_charts': 'Full name phonetic match',
   'tf_adjustment_column': 'full_name',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"forename_l" = "surname_r" and "forename_r" = "surname_l"',
   'label_for_charts': 'Exact match on reversed cols',
   'tf_adjustment_column': 'full_name',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"surname_l" = "surname_r"',
   'label_for_charts': 'Exact match surname',
   'tf_a

Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.