# Choosing Comparison Levels

When building a Splink model, one of the most important aspects is defining the [`Comparisons`](../comparison.md) and [`Comparison Levels`](../comparison_level.md) that the model will train on. Each `Comparison Level` within a `Comparison` should contain a different amount of evidence that two records are a match, which the model can assign a Match Weight to. 

`Comparison`s follow the structure below:

```
Data Linking Model
├─-- Comparison: Date of birth
│    ├─-- ComparisonLevel: Exact match
│    ├─-- ComparisonLevel: Up to one character difference
│    ├─-- ComparisonLevel: Up to three character difference
│    ├─-- ComparisonLevel: All other
├─-- Comparison: Name
│    ├─-- ComparisonLevel: Exact match on first name and surname
│    ├─-- ComparisonLevel: Exact match on first name
│    ├─-- etc.
```



## How records get assigned to Comparison Levels

When assigning a record pair to a `Comparison Level`, Splink runs sequentially through each level and assigns the record pair to the first level that they satisfy the condition of. As a result, when defining a `Comparison`, care must be taken on order of the `Comparison Level`s.

For example, if we construct two simple comparisons for a `name` column, both with an exact match and a fuzzy match ([levenshtein](./comparators.md#levenshtein-distance)) level:

In [1]:
import splink.duckdb.comparison_level_library as cll

comparison_correct_order = {
    "output_column_name": "name",
    "comparison_levels": [
        cll.null_level("name"),
        cll.exact_match_level("name"),
        cll.levenshtein_level("name", 2),
        cll.else_level(),
    ],
}

comparison_incorrect_order = {
    "output_column_name": "name",
    "comparison_levels": [
        cll.null_level("name"),
        cll.levenshtein_level("name", 2),
        cll.exact_match_level("name"),
        cll.else_level(),
    ],
}

Now, consider a list of names to consider and we can use the `get_comparison_levels` function from the `comparison_helpers` library to see how each combination of these names are assigned to a `Comparison Level` for our two `Comparison`s.

In [6]:
import splink.comparison_helpers as ch

names = ["Julia", "Julia", "Julie", "Rachel"]

ch.get_comparison_levels(names, comparison_correct_order)

Unnamed: 0_level_0,Unnamed: 1_level_0,name
comparison_level,gamma,Unnamed: 2_level_1
Exact match,2,"[[Julia, Julia]]"
Levenshtein <= 2,1,"[[Julia, Julie], [Julia, Julie]]"
All other comparisons,0,"[[Julia, Rachel], [Julia, Rachel], [Julie, Rac..."
Null,-1,[]


In [3]:
ch.get_comparison_levels(names, comparison_incorrect_order)

Unnamed: 0_level_0,Unnamed: 1_level_0,name
comparison_level,gamma,Unnamed: 2_level_1
Levenshtein <= 2,2,"[[Julia, Julia], [Julia, Julie], [Julia, Julie]]"
Exact match,1,[]
All other comparisons,0,"[[Julia, Rachel], [Julia, Rachel], [Julie, Rac..."
Null,-1,[]


So, when `Levenshtein <= 2` is the first level, exact matches will be captured (as two identical strings have a `Levenshtein = 0`) in that level as opposed to the `Exact match` level. As a result, the `Exact match` level will never have any records assigned to it and it will never be trained within the Splink model - making it redundant.

In general, given the ordered nature of `Comparison Level`s, it is recommended to order levels starting with those which show the greatest similarity between features then working down to the least similar.

Note, the `Comparison Levels` are indexed with `gamma` values, which are used behind the scenes in Splink. For example, you can see the gamma values for each `Comparison` when looking at the [Comparison Viewer Dashboard](../demos/06_Visualising_predictions.ipynb#comparison-viewer-dashboard).