# Out-of-the-box Comparisons for specific data types

Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The [Comparison Template Library](customising_comparisons.ipynb#method-2-using-the-comparisontemplatelibrary) contains functions to generate ready-made comparisons for a variety of data types.

Below are examples of how to structure comparisons for a variety of data types.

<hr>

## Date Comparisons

Date comparisons are generally structured as: 

- Null level  
- Exact match  
- Fuzzy match ([using metric of choice](comparators.md))  
- Interval match (within X days/months/years)  
- Else level

The [comparison_template_library](../comparison_template_library.md#splink.comparison_template_library) contains the [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box.

In [1]:
from splink.duckdb.duckdb_comparison_template_library import date_comparison

date_of_birth_comparison = date_comparison("date_of_birth")

Gives a comparison structured as follows:

```
Comparison: Date of birth
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: Up to one character difference
├─-- ComparisonLevel: Up to two character difference
├─-- ComparisonLevel: Dates within 1 year of each other
├─-- ComparisonLevel: Dates within 10 years of each other
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `date_of_birth_comparison`:

In [2]:
print(date_of_birth_comparison.human_readable_description)

Comparison 'Exact match vs. Date_Of_Birth within levenshtein thresholds 1, 2 vs. Dates within the following thresholds Year(s): 1, Year(s): 10 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
    - 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'Levenshtein <= 1' with SQL rule: levenshtein("date_of_birth_l", "date_of_birth_r") <= 1
    - 'Levenshtein <= 2' with SQL rule: levenshtein("date_of_birth_l", "date_of_birth_r") <= 2
    - 'Within 1 year' with SQL rule: 
            abs(date_diff('year', "date_of_birth_l",
              "date_of_birth_r")) <= 1
        
    - 'Within 10 years' with SQL rule: 
            abs(date_diff('year', "date_of_birth_l",
              "date_of_birth_r")) <= 10
        
    - 'All other comparisons' with SQL rule: ELSE



The [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

In [3]:
date_of_birth_comparison = date_comparison(
    "date_of_birth",
    levenshtein_thresholds=[],
    jaro_winkler_thresholds=[0.88],
    datediff_thresholds=[1, 1],
    datediff_metrics=["month", "year"],
)
print(date_of_birth_comparison.human_readable_description)

Comparison 'Exact match vs. Date_Of_Birth within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
    - 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity("date_of_birth_l", "date_of_birth_r") >= 0.88
    - 'Within 1 month' with SQL rule: 
            abs(date_diff('month', "date_of_birth_l",
              "date_of_birth_r")) <= 1
        
    - 'Within 1 year' with SQL rule: 
            abs(date_diff('year', "date_of_birth_l",
              "date_of_birth_r")) <= 1
        
    - 'All other comparisons' with SQL rule: ELSE



To see this as a specifications dictionary you can call

In [4]:
date_of_birth_comparison.as_dict()

{'output_column_name': 'date_of_birth',
 'comparison_levels': [{'sql_condition': '"date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"date_of_birth_l" = "date_of_birth_r"',
   'label_for_charts': 'Exact match'},
  {'sql_condition': 'jaro_winkler_similarity("date_of_birth_l", "date_of_birth_r") >= 0.88',
   'label_for_charts': 'Jaro_winkler_similarity >= 0.88'},
  {'sql_condition': '\n            abs(date_diff(\'month\', "date_of_birth_l",\n              "date_of_birth_r")) <= 1\n        ',
   'label_for_charts': 'Within 1 month'},
  {'sql_condition': '\n            abs(date_diff(\'year\', "date_of_birth_l",\n              "date_of_birth_r")) <= 1\n        ',
   'label_for_charts': 'Within 1 year'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Date_Of_Birth within jaro_winkler threshold 0.88 vs. Dates within the following threshol

which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Name Comparisons

Name comparisons for an individual name column (e.g. forename, surname) are generally structured as: 

- Null level  
- Exact match  
- Fuzzy match ([using metric of choice](comparators.md))  
- Else level

The [comparison_template_library](../comparison_template_library.md##splink.comparison_template_library) contains the [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box.

In [5]:
from splink.duckdb.duckdb_comparison_template_library import name_comparison

first_name_comparison = name_comparison("first_name")

Gives a comparison structured as follows:

```
Comparison: Date of birth
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.95
├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.88
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `first_name_comparison`:

In [6]:
print(first_name_comparison.human_readable_description)

Comparison 'Exact match vs. First_Name within jaro_winkler thresholds 0.95, 0.88 vs. anything else' of "first_name".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'Jaro_winkler_similarity >= 0.95' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.95
    - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88
    - 'All other comparisons' with SQL rule: ELSE



The [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function also allowing flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

In [7]:
surname_comparison = name_comparison(
    "surname",
    phonetic_col_name="surname_dm",
    term_frequency_adjustments_name=True,
    levenshtein_thresholds=[2],
    jaro_winkler_thresholds=[],
    jaccard_thresholds=[1],
)
print(surname_comparison.human_readable_description)

Comparison 'Exact match vs. Names with phonetic exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. anything else' of "surname" and "surname_dm".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "surname_l" IS NULL OR "surname_r" IS NULL
    - 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match surname_dm' with SQL rule: "surname_dm_l" = "surname_dm_r"
    - 'Levenshtein <= 2' with SQL rule: levenshtein("surname_l", "surname_r") <= 2
    - 'Jaccard >= 1' with SQL rule: jaccard("surname_l", "surname_r") >= 1
    - 'All other comparisons' with SQL rule: ELSE



Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md).

To see this as a specifications dictionary you can call

In [8]:
surname_comparison.as_dict()

{'output_column_name': 'custom_surname_surname_dm',
 'comparison_levels': [{'sql_condition': '"surname_l" IS NULL OR "surname_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"surname_l" = "surname_r"',
   'label_for_charts': 'Exact match surname',
   'tf_adjustment_column': 'surname',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"surname_dm_l" = "surname_dm_r"',
   'label_for_charts': 'Exact match surname_dm'},
  {'sql_condition': 'levenshtein("surname_l", "surname_r") <= 2',
   'label_for_charts': 'Levenshtein <= 2'},
  {'sql_condition': 'jaccard("surname_l", "surname_r") >= 1',
   'label_for_charts': 'Jaccard >= 1'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. anything else'}

which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Postcode Comparisons

The [comparison_template_library](../comparison_template_library.md##splink.comparison_template_library) contains the [postcode_comparison](../comparison_template_library.md##splink.comparison_template_library.PostcodeComparisonBase) function which provides a sensible approach to comparing postcodes in terms of their constituent components, out-of-the-box. See [Feature Engineering](./feature_engineering.md) for more details.

In [9]:
from splink.duckdb.duckdb_comparison_template_library import postcode_comparison

pc_comparison = postcode_comparison("postcode")

Gives a comparison structured as follows:

```
Comparison: Postcode
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: Exact match on sector
├─-- ComparisonLevel: Exact match on district
├─-- ComparisonLevel: Exact match on area
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `pc_comparison`:

In [10]:
print(pc_comparison.human_readable_description)

Comparison 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. all other comparisons' of "postcode".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match postcode' with SQL rule: "postcode_l" = "postcode_r"
    - 'Exact match Postcode Sector' with SQL rule: 
        regexp_extract("postcode_l", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9]')
     = 
        regexp_extract("postcode_r", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9]')
    
    - 'Exact match Postcode District' with SQL rule: 
        regexp_extract("postcode_l", '^[A-Z]{1,2}[0-9][A-Z0-9]?')
     = 
        regexp_extract("postcode_r", '^[A-Z]{1,2}[0-9][A-Z0-9]?')
    
    - 'Exact match Postcode Area' with SQL rule: 
        regexp_extract("postcode_l", '^[A-Z]{1,2}')
     = 
        regexp_extract("postcode_r", '^[A-Z]{1,2}')
    
    - 'All other comparisons' with SQL rule: ELSE



where individual postcode components are extracted under-the-hood using the `regex_extract` argument.

Note that the 'Exact match Postcode District' level also captures matches on subdistricts where they exist in the data.

Performing comparisons based on substrings alone doesn't always give the best sense of whether two postcodes are close together since locations which are geographically close can be in different postcode regions e.g. London postcodes starting 'N' vs 'SW'. Given this, the [postcode_comparison](../comparison_template_library.md##splink.comparison_template_library.PostcodeComparisonBase) function also allows the user flexibility to include [cll.distance_in_km_level()](../comparison_level_library.md#splink.comparison_level_library.DistanceFunctionLevelBase) by supplying `lat_col`, `long_col` and `km_thresholds` arguments. This can help to improve results. (See [Feature Enginnering](./feature_engineering.md) for more details.)

Users also have the option to set `invalid_postcodes_as_null` to `True`. If `True`, postcodes that do not adhere to a valid postcode format as determined by `valid_postcode_regex` will be included in the null level. `valid_postcode_regex` defaults to `"^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$"`.

For example:

In [11]:
pc_comparison = postcode_comparison(
    "postcode",
    invalid_postcodes_as_null=True,
    lat_col="lat",
    long_col="long",
    km_thresholds=[1, 10, 50]
)
print(pc_comparison.human_readable_description)

Comparison 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. Postcode within km_distance thresholds 1, 10, 50 vs. all other comparisons' of "postcode", "long" and "lat".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: 
        regexp_extract("postcode_l", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$')
     IS NULL OR 
        regexp_extract("postcode_r", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$')
     IS NULL OR
                      
        regexp_extract("postcode_l", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$')
    =='' OR 
        regexp_extract("postcode_r", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$')
     ==''
    - 'Exact match postcode' with SQL rule: "postcode_l" = "postcode_r"
    - 'Exact match Postcode Sector' with SQL rule: 
        regexp_extract("postcode_l", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9]')
     = 
        regexp_extract("postcode_r", '^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9]')

To see this as a specifications dictionary you can call

In [12]:
pc_comparison.as_dict()

{'output_column_name': 'postcode',
 'comparison_levels': [{'sql_condition': '\n        regexp_extract("postcode_l", \'^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$\')\n     IS NULL OR \n        regexp_extract("postcode_r", \'^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$\')\n     IS NULL OR\n                      \n        regexp_extract("postcode_l", \'^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$\')\n    ==\'\' OR \n        regexp_extract("postcode_r", \'^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$\')\n     ==\'\'',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"postcode_l" = "postcode_r"',
   'label_for_charts': 'Exact match postcode'},
  {'sql_condition': '\n        regexp_extract("postcode_l", \'^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9]\')\n     = \n        regexp_extract("postcode_r", \'^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9]\')\n    ',
   'label_for_charts': 'Exact match Postcode Sector'},
  {'sql_condition': '\n        regexp_extract("postcode_l", \'^[A-Z]{1,2}[0-9][A-Z0-9]?\')\n   

which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.