# Out-of-the-box Comparisons for specific data types

Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The [Comparison Template Library](customising_comparisons.ipynb#method-2-using-the-comparisontemplatelibrary) contains functions to generate ready-made comparisons for a variety of data types.

Below are examples of how to structure comparisons for a variety of data types.

<hr>

## DateComparison

The date comparison is structured as: 

- Null level  
- Exact match  
- Fuzzy match (Damerau-Levenshtein distance of 1 character)
- Date interval match (within X days/months/years)  
- Else level


In [6]:
import splink.comparison_template_library as ctl

date_of_birth_comparison = ctl.DateComparison(
    "date_of_birth",
    datetime_metrics=["month", "year", "year"],
    datetime_thresholds=[1, 1, 10],
    input_is_string=True,
)

Gives a comparison structured as follows:

```
Comparison: Date of birth
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: Up to one character difference
├─-- ComparisonLevel: Dates within 1 month of each other
├─-- ComparisonLevel: Dates within 1 year of each other
├─-- ComparisonLevel: Dates within 10 years of each other
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `date_of_birth_comparison`:

In [7]:
print(date_of_birth_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'Exact match vs. Damerau-Levenshtein distance <= 1 vs. month difference <= 1 vs. year difference <= 1 vs. year difference <= 10 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'date_of_birth is NULL' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
    - 'Exact match on date_of_birth' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'Damerau-Levenshtein distance of date_of_birth <= 1' with SQL rule: damerau_levenshtein("date_of_birth_l", "date_of_birth_r") <= 1
    - 'Abs difference of 'transformed date_of_birth <= 1 month'' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 2629800.0
    - 'Abs difference of 'transformed date_of_birth <= 1 year'' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 31557600.0
    - 'Abs difference of 'tran

To see this as a specifications dictionary you can call

In [8]:
date_of_birth_comparison.get_comparison("duckdb").as_dict()

{'output_column_name': 'date_of_birth',
 'comparison_levels': [{'sql_condition': '"date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL',
   'label_for_charts': 'date_of_birth is NULL',
   'is_null_level': True},
  {'sql_condition': '"date_of_birth_l" = "date_of_birth_r"',
   'label_for_charts': 'Exact match on date_of_birth'},
  {'sql_condition': 'damerau_levenshtein("date_of_birth_l", "date_of_birth_r") <= 1',
   'label_for_charts': 'Damerau-Levenshtein distance of date_of_birth <= 1'},
  {'sql_condition': 'ABS(EPOCH(try_strptime("date_of_birth_l", \'%Y-%m-%d\')) - EPOCH(try_strptime("date_of_birth_r", \'%Y-%m-%d\'))) <= 2629800.0',
   'label_for_charts': "Abs difference of 'transformed date_of_birth <= 1 month'"},
  {'sql_condition': 'ABS(EPOCH(try_strptime("date_of_birth_l", \'%Y-%m-%d\')) - EPOCH(try_strptime("date_of_birth_r", \'%Y-%m-%d\'))) <= 31557600.0',
   'label_for_charts': "Abs difference of 'transformed date_of_birth <= 1 year'"},
  {'sql_condition': 'ABS(EPOCH(try_strp

which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Name Comparisons

Name comparisons for an individual name column (e.g. forename, surname) are generally structured as: 

- Null level  
- Exact match  
- Fuzzy match ([using metric of choice](comparators.md))  
- Else level

The [comparison_template_library](../../comparison_template_library.md##splink.comparison_template_library) contains the [name_comparison](../../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box.

In [11]:
import splink.comparison_template_library as ctl

first_name_comparison = ctl.NameComparison("first_name")

Gives a comparison structured as follows:

```
Comparison: First Name
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: First Names with Jaro-Winkler similarity of 0.9 or greater 
├─-- ComparisonLevel: First Names with Jaro-Winkler similarity of 0.8 or greater
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `first_name_comparison`:

In [12]:
print(first_name_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'jaro_winkler at thresholds 0.9, 0.8 vs. anything else' of "first_name".
Similarity is assessed using the following ComparisonLevels:
    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'Jaro-Winkler distance of first_name >= 0.9' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9
    - 'Jaro-Winkler distance of first_name >= 0.8' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.8
    - 'All other comparisons' with SQL rule: ELSE



The [name_comparison](../../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function also allows flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

In [17]:
# TODO: What do we want to say here?

surname_comparison = ctl.NameComparison(
    "surname",
    phonetic_col_name="surname_dm",
    fuzzy_metric="levenshtein",
    fuzzy_thresholds=[4,3,2]
)
print(surname_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'levenshtein at thresholds 4, 3, 2 vs. anything else' of "surname" and "surname_dm".
Similarity is assessed using the following ComparisonLevels:
    - 'surname is NULL' with SQL rule: "surname_l" IS NULL OR "surname_r" IS NULL
    - 'Exact match on surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match on surname_dm' with SQL rule: "surname_dm_l" = "surname_dm_r"
    - 'Levenshtein distance of surname <= 4' with SQL rule: levenshtein("surname_l", "surname_r") <= 4
    - 'Levenshtein distance of surname <= 3' with SQL rule: levenshtein("surname_l", "surname_r") <= 3
    - 'Levenshtein distance of surname <= 2' with SQL rule: levenshtein("surname_l", "surname_r") <= 2
    - 'All other comparisons' with SQL rule: ELSE



Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md).

To see this as a specifications dictionary you can call

In [8]:
surname_comparison.as_dict()

{'output_column_name': 'custom_surname_surname_dm',
 'comparison_levels': [{'sql_condition': '"surname_l" IS NULL OR "surname_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"surname_l" = "surname_r"',
   'label_for_charts': 'Exact match surname',
   'tf_adjustment_column': 'surname',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"surname_dm_l" = "surname_dm_r"',
   'label_for_charts': 'Exact match surname_dm',
   'tf_adjustment_column': 'surname_dm',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': 'levenshtein("surname_l", "surname_r") <= 2',
   'label_for_charts': 'Levenshtein <= 2'},
  {'sql_condition': 'jaccard("surname_l", "surname_r") >= 1',
   'label_for_charts': 'Jaccard >= 1'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. anything e

which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Forename and Surname Comparisons

It can be helpful to construct a single comparison for for comparing the forename and surname of two records as:

1. The Fellegi-Sunter model **assumes that comparisons are independent**. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.

2. **Term-frequencies** of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are. 
For example, in the UK population “Mohammed Khan” is a relatively common full name despite neither name occurring frequently. For more information on term-frequencies, see the dedicated [topic guide](term-frequency.md).
Addressing forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.

3. It is common for some records to have **swapped forename and surname by mistake**. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.


Forename and Surname comparisons for an individual name column (e.g. forename, surname) are generally structured as: 

- Null level  
- Exact match Forename and Surname
- Exact match Forename and Surname swapped
- Exact match Surname
- Exact match Forename
- Fuzzy match Surname ([using metric of choice](comparators.md))
- Fuzzy match Forename ([using metric of choice](comparators.md))
- Else level

The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [forename_surname_comparison](../../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box.

In [22]:
import splink.comparison_template_library as ctl

full_name_comparison = ctl.ForenameSurnameComparison("forename", "surname")

Gives a comparison structured as follows:

```
Comparison: First Name
├─-- ComparisonLevel: Exact match Forename and Surname
├─-- ComparisonLevel: Exact match Forename and Surname swapped
├─-- ComparisonLevel: Exact match Surname
├─-- ComparisonLevel: Exact match Forename
├─-- ComparisonLevel: Surnames with Jaro-Winkler similarity greater than 0.88
├─-- ComparisonLevel: Forenames with Jaro-Winkler similarity greater than 0.88
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `first_name_comparison`:

In [19]:
print(full_name_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'jaro_winkler forename at threshold 0.88 vs. anything else' of "surname" and "forename".
Similarity is assessed using the following ComparisonLevels:
    - '(forename is NULL) AND (surname is NULL)' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
    - '(Exact match on forename) AND (Exact match on surname)' with SQL rule: ("forename_l" = "forename_r") AND ("surname_l" = "surname_r")
    - 'Match on reversed cols: forename and surname' with SQL rule: "forename_l" = "surname_r" AND "forename_r" = "surname_l"
    - 'Exact match on surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match on forename' with SQL rule: "forename_l" = "forename_r"
    - 'Jaro-Winkler distance of surname >= 0.88' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.88
    - 'Jaro-Winkler distance of forename >= 0.88' with SQL rule: jaro_winkler_similarity("forename_l", "forename_r") >= 0.88
    - 'All other co

The [forename_surname_comparison](../../comparison_template_library.md#splink.comparison_template_library.ForenameSurnameComparisonBase) function also allows flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

In [20]:
# TODO:  Update following final decision on what this should look like:

# full_name_comparison = forename_surname_comparison(
#     "forename",
#     "surname",
#     term_frequency_adjustments=True,
#     tf_adjustment_col_forename_and_surname="full_name",
#     phonetic_forename_col_name="forename_dm",
#     phonetic_surname_col_name="surname_dm",
#     levenshtein_thresholds=[2],
#     jaro_winkler_thresholds=[],
#     jaccard_thresholds=[1],
# )
# print(full_name_comparison.human_readable_description)

Where:

- `forename_dm` and `surname_dm` refer to columns which have used the DoubleMetaphone algorithm on `forename` and `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md). These columns will have to already exist in the dataset, or be created in the [feature engineering](../data_preparation/feature_engineering.md#phonetic-transformations) stage when preparing datasets for linking.

- `full_name` is a column containing `forename` and `surname` so that the model can consider the term-frequency of the full name, as well as `forename` and `surname` individually. These columns will have to already exist in the dataset, or be created in the [feature engineering](../data_preparation/feature_engineering.md#full-name) stage when preparing datasets for linking.

To see this as a specifications dictionary you can call

In [23]:
full_name_comparison.get_comparison("duckdb").as_dict()

{'output_column_name': 'forename_surname',
 'comparison_levels': [{'sql_condition': '("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)',
   'label_for_charts': '(forename is NULL) AND (surname is NULL)',
   'is_null_level': True},
  {'sql_condition': '("forename_l" = "forename_r") AND ("surname_l" = "surname_r")',
   'label_for_charts': '(Exact match on forename) AND (Exact match on surname)'},
  {'sql_condition': '"forename_l" = "surname_r" AND "forename_r" = "surname_l"',
   'label_for_charts': 'Match on reversed cols: forename and surname'},
  {'sql_condition': '"surname_l" = "surname_r"',
   'label_for_charts': 'Exact match on surname'},
  {'sql_condition': '"forename_l" = "forename_r"',
   'label_for_charts': 'Exact match on forename'},
  {'sql_condition': 'jaro_winkler_similarity("surname_l", "surname_r") >= 0.88',
   'label_for_charts': 'Jaro-Winkler distance of surname >= 0.88'},
  {'sql_condition': 'jaro_winkler_similarity("forenam

Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Postcode Comparisons

The [comparison_template_library](../../comparison_template_library.md##splink.comparison_template_library) contains the [postcode_comparison](../../comparison_template_library.md##splink.comparison_template_library.PostcodeComparisonBase) function which provides a sensible approach to comparing postcodes in terms of their constituent components, out-of-the-box. See [Feature Engineering](../data_preparation/feature_engineering.md) for more details.

In [28]:
import splink.comparison_template_library as ctl
# TODO:Decide what this looks like
pc_comparison = ctl.PostcodeComparison("postcode", km_thresholds=[0.1, 0.2], lat_col="lat", long_col="lon")

Gives a comparison structured as follows:

```
Comparison: Postcode
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: Exact match on sector
├─-- ComparisonLevel: Exact match on district
├─-- ComparisonLevel: Exact match on area
├─-- ComparisonLevel: All other
```

Or, using `human_readable_description` to generate automatically from `pc_comparison`:

In [29]:
print(pc_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'km distance within thresholds 0.1, 0.2 vs. anything else' of "postcode", "lon" and "lat".
Similarity is assessed using the following ComparisonLevels:
    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match on postcode' with SQL rule: "postcode_l" = "postcode_r"
    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}', 0), '')
    - 'Distance less than 0.1km' with SQL 

where individual postcode components are extracted under-the-hood using the `regex_extract` argument.

Note that the 'Exact match Postcode District' level also captures matches on subdistricts where they exist in the data.

Performing comparisons based on substrings alone doesn't always give the best sense of whether two postcodes are close together since locations which are geographically close can be in different postcode regions e.g. London postcodes starting 'N' vs 'SW'. Given this, the [postcode_comparison](../../comparison_template_library.md##splink.comparison_template_library.PostcodeComparisonBase) function also allows the user flexibility to include [cll.distance_in_km_level()](../../comparison_level_library.md#splink.comparison_level_library.DistanceFunctionLevelBase) by supplying `lat_col`, `long_col` and `km_thresholds` arguments. This can help to improve results. (See [Feature Enginnering](../data_preparation/feature_engineering.md) for more details.)

Users also have the option to set `invalid_postcodes_as_null` to `True`. If `True`, postcodes that do not adhere to a valid postcode format as determined by `valid_postcode_regex` will be included in the null level. `valid_postcode_regex` defaults to `"^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$"`.

For example:

In [15]:
# TODO: What does this look like?
# pc_comparison = postcode_comparison(
#     "postcode",
#     invalid_postcodes_as_null=True,
#     lat_col="lat",
#     long_col="long",
#     km_thresholds=[1, 10, 50]
# )
# print(pc_comparison.human_readable_description)

Comparison 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. Postcode within km_distance thresholds 1, 10, 50 vs. all other comparisons' of "postcode", "long" and "lat".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: 
        regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
     IS NULL OR 
        regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
     IS NULL OR
                      
        regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
    =='' OR 
        regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
     ==''
    - 'Exact match postcode' with SQL rule: lower("postcode_l") = lower("postcode_r")
    - 'Exact match Postcode Sector' with SQL rule: 
        regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')
     = 
        

To see this as a specifications dictionary you can call

In [30]:
pc_comparison.get_comparison("duckdb").as_dict()

{'output_column_name': 'postcode',
 'comparison_levels': [{'sql_condition': '"postcode_l" IS NULL OR "postcode_r" IS NULL',
   'label_for_charts': 'postcode is NULL',
   'is_null_level': True},
  {'sql_condition': '"postcode_l" = "postcode_r"',
   'label_for_charts': 'Exact match on postcode'},
  {'sql_condition': 'NULLIF(regexp_extract("postcode_l", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\', 0), \'\') = NULLIF(regexp_extract("postcode_r", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\', 0), \'\')',
   'label_for_charts': 'Exact match on transformed postcode'},
  {'sql_condition': 'NULLIF(regexp_extract("postcode_l", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?\', 0), \'\') = NULLIF(regexp_extract("postcode_r", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?\', 0), \'\')',
   'label_for_charts': 'Exact match on transformed postcode'},
  {'sql_condition': 'NULLIF(regexp_extract("postcode_l", \'^[A-Za-z]{1,2}\', 0), \'\') = NULLIF(regexp_extract("postcode_r", \'^[A-Za-z]{1,2}\', 0), \'\')',
   'label_for_charts': 'Exa

which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.

<hr>

## Email Comparisons

Email comparisons are generally structured as:

* Null Level
* Exact match on email address
* Exact match on username
* Fuzzy match on email address
* Fuzzy match on username
* All other comparisons

The [comparison_template_library](../../comparison_template_library.md##splink.comparison_template_library) contains the [email_comparison](../../comparison_template_library.md##splink.comparison_template_library.EmailComparisonBase) function which provides a sensible approach to comparing emails out-of-the-box.

In [33]:
import splink.comparison_template_library as ctl

standard_email_comparison = ctl.EmailComparison("email")

Gives a comparison structured as follows:

```
Comparison: Email
├─-- ComparisonLevel: Exact match
├─-- ComparisonLevel: Exact match on username with different domain
├─-- ComparisonLevel: Fuzzy match on email using Jaro-Winkler
├─-- ComparisonLevel: Fuzzy match on username using Jaro-Winkler
├─-- ComparisonLevel: All other comparisons
```

Or, using `human_readable_description` to generate automatically from `email_comparison`:

In [34]:
print(standard_email_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'jaro_winkler on username at threshold 0.88 vs. anything else' of "email".
Similarity is assessed using the following ComparisonLevels:
    - 'email is NULL' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
    - 'Exact match on email' with SQL rule: "email_l" = "email_r"
    - 'Exact match on transformed email' with SQL rule: NULLIF(regexp_extract("email_l", '^[^@]+', 0), '') = NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')
    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
    - 'Jaro-Winkler distance of transformed email >= 0.88' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract("email_l", '^[^@]+', 0), ''), NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')) >= 0.88
    - 'All other comparisons' with SQL rule: ELSE



where individual email components are extracted under-the-hood using the `regex_extract` argument.

By default, the fuzzy matching is done using Jaro-Winkler thresholds. This will bias the start of a string, specifically the first four characters, which may not be appropriate for all emails. The `email_comparison` function is flexible and allows a number of other string fuzzy matching functions.

Users also have the option to set `invalid_emails_as_null` to `True`. If `True`, postcodes that do not adhere to a valid email format as determined by `valid_email_regex` will be included in the null level. `valid_email_regex` defaults to `"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$"`.

For example:

In [19]:
# TODO: What does this look like?

# bespoke_email_comparison = email_comparison(
#     "email",
#     jaro_winkler_thresholds=[],
#     jaro_thresholds=[0.8],
#     include_username_match_level=False,
#     include_domain_match_level=True,
#     invalid_emails_as_null=True,
# )
# print(bespoke_email_comparison.human_readable_description)

Comparison 'Exact match vs. Domain-only match vs.anything else' of "email".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: 
        regexp_extract("email_l", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
     IS NULL OR 
        regexp_extract("email_r", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
     IS NULL OR
                      
        regexp_extract("email_l", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
    =='' OR 
        regexp_extract("email_r", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
     ==''
    - 'Exact match email' with SQL rule: "email_l" = "email_r"
    - 'Jaro_similarity email >= 0.8' with SQL rule: jaro_similarity("email_l", "email_r") >= 0.8
    - 'Jaro_similarity email >= 0.8' with SQL rule: jaro_similarity("email_l", "email_r") >= 0.8
    - 'Exact match Email Domain' with SQL rule: 
        regexp_extract("email_l", '@([^@]+)$')
     = 
        regexp_extract("email_r", '@([^@]+)$')

To see this as a specifications dictionary you can call

In [20]:
# bespoke_email_comparison.as_dict()

{'output_column_name': 'email',
 'comparison_levels': [{'sql_condition': '\n        regexp_extract("email_l", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n     IS NULL OR \n        regexp_extract("email_r", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n     IS NULL OR\n                      \n        regexp_extract("email_l", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n    ==\'\' OR \n        regexp_extract("email_r", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n     ==\'\'',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"email_l" = "email_r"',
   'label_for_charts': 'Exact match email'},
  {'sql_condition': 'jaro_similarity("email_l", "email_r") >= 0.8',
   'label_for_charts': 'Jaro_similarity email >= 0.8'},
  {'sql_condition': 'jaro_similarity("email_l", "email_r") >= 0.8',
   'label_for_charts': 'Jaro_similarity email >= 0.8'},
  {'sql_condition': '\n        regexp_extract("email_l", \'@([^@]+)$\')\n     =

which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.