# Analysing blocking rules to optimise runtimes

For most large datasets, it will be computationally intractable to compare every row with every other row for the purpose of record linkage. We can use a technique called blocking to dramatically reduce the number of comparisons by comparing only records that adhere to certain rules, such as that the first name and date of birth must be equal.

This is part of a two step process to link data:

Step 1:  Use blocking rules to generate candidate pairwise record comparisons

Step 2:  Use a probabilistic linkage model to score these candidate pairs, to determine which ones should be linked

In Splink, blocking rules are specified as SQL expressions. For example, to generate the subset of record comparisons where the first name matches, we can specify the following blocking rule:

`l.first_name = r.first_name`

Since blocking rules are SQL expressions, they can be arbitrarily complex. A example is as follows:

`substr(l.first_name, 1,1) = substr(r.first_name, 1,1) and l.surname = r.surname`



## Devising effective blocking rules 

https://github.com/moj-analytical-services/splink_demos/blob/d391cd32128cd0d51e4a1275ea7d833f2c043462/deduplication_detailed_example.ipynb

The aim of your blocking rules are twofold:
1. Eliminate enough non-matching comparison pairs for your record linkage problem to be computationally tractable
2. Retain all true matching pairs (or as close as possible).  

It is usually impossible to find a single blocking rule which achieves both aims, so we recommend using multiple blocking rules.  

For example, consider the following blocking rule:

`l.first_name = r.first_name and l.dob = r.dob`

This rule is likely to be effective in reducing the number of comparison pairs.  It will also retain many true matching pairs, including those with errors on other columns such as surname.

However, it will eliminate any true matches where there are typos, nulls or other errors in the `first_name` or `dob` fields.

Now consider a second blocking rule:

`l.email and r.email`.

This rule is also likely to be effective in reducing the number of comparison pairs, and will retain many true matching pairs.  However, it will eliminate true matches where there are typos or nulls in the `email` column.

Neither one of these blocking rules achieves the two aims set out above.

But between them, they might do quite a good job.  

When we specify multiple blocking rules, Splink will generate all comparison pairs that meet any one of the rules.

This means that, for a true match to be eliminated by the use of these two blocking rules, it would have to have an error in _both_ the email, and the first name or date of birth.  

This is not completely implausible, but it is significantly less likely than if we'd just used a single rule.

More generally, we can specify a longer list of many blocking rules such that it becomes almost compeltely implasible that a true match would not meet at least one of these blocking critera.

