Finds blocking rules which return a comparison count below a given threshold #1665

RobinL · 2023-10-24T10:50:47Z

This is the first PR in a series which will establish two new functions.

linker._detect_blocking_rules_for_prediction(
    max_comparisons_per_rule=1e3, min_freedom=1, 
)
linker._detect_blocking_rules_for_em_training(
    max_comparisons_per_rule=1e2, min_freedom=0
)

We have decided to leave these functions private (_) for now so we can test them out before introducing them to the public API

See here for further description.

This PR adds the first ingredient: the ability to quickly and efficiently find a list of blocking rules that returns a comparison count below a certain threshold.

Here's my current strategy.

This is motivated by exploiting the new linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br) function that is super fast, so can evaluate large numbers of possible blocking rules quickly:

Let's say we have columns c1, c2, c3, etc.

We recurse through a tree running functions like:

c1                    count_comparisons(c1)           
├── c2                count_comparisons(c1, c2)
│   └── c3            count_comparisons(c1, c2, c3)     
├── c3                count_comparisons(c1, c3)
│   └── c2            count_comparisons(c1, c3, c2)  
c2                    count_comparisons(c2)          
├── c1                count_comparisons(c2, c1)       
│   └── c3            count_comparisons(c2, c1, c3)
├── c3                count_comparisons(c2, c3)
│   └── c1            count_comparisons(c2, c3, c1)
etc.

It skips running any count that has been generated before with the same columns in a different order

Once the count is below the threshold, no branches from the node are explored.

So for example, if count_comparisons(c1,c2) < threshold_max_rows, it will skip all children such as count_comparisons(c1,c2,c3). In practice, this means only a small subset of the full tree is explored, making things a lot faster.

Moving to the real outputs now with a threshold of 1m, this results in a table of results like:

blocking_rules	comparison_count	complexity
first_name, surname	533,375	2
first_name, dob	174,651	2
first_name, birth_place	344,971	2
first_name, postcode_fake	149,105	2
first_name, gender, surname	373,354	3
etc.

(Note the actual table has one hot encoding like this)

	blocking_rules	comparison_count	complexity	__fixed__first_name	__fixed__surname	__fixed__dob	__fixed__birth_place	__fixed__postcode_fake	__fixed__gender
0	first_name, surname	533,375	2	1	1	0	0	0	0
1	first_name, dob	174,651	2	1	0	1	0	0	0
2	first_name, birth_place	344,971	2	1	0	0	1	0	0
3	first_name, postcode_fake	149,105	2	1	0	0	0	1	0
4	first_name, gender, surname	373,354	3	1	1	0	0	0	1

Test script

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
from splink.find_brs_with_comparison_counts_below_threshold import (
    find_blocking_rules_below_threshold_comparison_count,
)

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
}
linker = DuckDBLinker(df, settings)
linker._get_input_columns

find_blocking_rules_below_threshold_comparison_count(
    linker, max_comparisons_per_rule=10000
)

ThomasHepworth · 2023-10-25T14:32:17Z

splink/linker.py

@@ -262,6 +262,29 @@ def _get_input_columns(

        return column_names

+    @property
+    def _column_names_as_input_columns(


This appears to be getting duplicated all over the place.

Ideally we'll have a single method that handles the extraction of our input columns - #1611, that can then be extended to clean, process, etc.

I'm happy to approve this PR with this function in it, if you're ok with cleaning it up once Andy's original PR goes in?

I've merged in master that includes Andy's work, and added arguments to his function so it's easy to filter out unique id columns and any specified additional_columns_to_retain

splink/find_brs_with_comparison_counts_below_threshold.py

ThomasHepworth · 2023-10-25T14:38:00Z

splink/find_brs_with_comparison_counts_below_threshold.py

+logger = logging.getLogger(__name__)
+
+
+def sanitise_column_name(column_name):


This function may want to live inside misc.py.

It feels like something we'd want to use elsewhere.

Does this cause issues for users that have "spacey column names"?

i.e. my column is a valid column name in the rest of our code base, but would have the space stripped here.

I have now named things better in this commit

The sanitisation here is purely to help establish a correspondence between one hot encoded columns __fixed__surname and the blocking rules in blocking_columns. I sanitise so I don't need to worry that the one hot encoded columns end up with special characters/spaces etc.

This also means spaces don't matter because the true blocking rule is held in splink_blocking_rule

To illustrate where this is used

ThomasHepworth · 2023-10-25T15:37:24Z

splink/find_brs_with_comparison_counts_below_threshold.py

+    ├── c1                count_comparisons(c2, c1)
+    │   └── c3            count_comparisons(c2, c1, c3)


Is it worth making it clear in this diagram that these have already been visited?

They're stored in the hashset, which should quickly trim them from next_combations.

(I may have misunderstood the order in which your loops work)

ThomasHepworth · 2023-10-25T15:49:09Z

splink/find_brs_with_comparison_counts_below_threshold.py

+    row = _generate_output_combinations_table_row(
+        current_combination,
+        br,
+        comparison_count,
+        all_columns,
+    )


Should we not move this down to line 182? We're constructing the hashmap here and only using it in the else section.

Good spot, done

splink/find_brs_with_comparison_counts_below_threshold.py

ThomasHepworth

The core functionality is amazing here!!

Some very minor tweaks needed and then I'll tick this off.

…s_below_threshold

threshold

…s_below_threshold

ThomasHepworth · 2023-11-20T10:36:12Z

splink/find_brs_with_comparison_counts_below_threshold.py

+    """Generate a Splink blocking rule given a list of column names which
+    are provided as as string"""
+
+    dialect = linker._sql_dialect


It might be worthwhile adding in a # TODO: remove in Splink4 tag here.

There's a VSCode plugin that allows you to see all outstanding TODO comments that would ensure we don't forget about them too

splink/find_brs_with_comparison_counts_below_threshold.py

ThomasHepworth

Thanks Robin - again, a really interesting DP problem to read through.

Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>

find blocking rules

076c011

RobinL mentioned this pull request Oct 24, 2023

Compute the cost of combinations of blocking rules #1667

Merged

RobinL requested a review from ThomasHepworth October 24, 2023 16:13

RobinL assigned ThomasHepworth Oct 24, 2023

RobinL mentioned this pull request Oct 24, 2023

Automatically detect blocking rules for prediction and blocking rules for EM training #1668

Merged

RobinL unassigned ThomasHepworth Oct 24, 2023

RobinL mentioned this pull request Oct 24, 2023

(WIP) Auto blocking rules #1464

Closed

ThomasHepworth reviewed Oct 25, 2023

View reviewed changes

splink/find_brs_with_comparison_counts_below_threshold.py Outdated Show resolved Hide resolved

ThomasHepworth reviewed Oct 25, 2023

View reviewed changes

splink/find_brs_with_comparison_counts_below_threshold.py Show resolved Hide resolved

ThomasHepworth reviewed Oct 25, 2023

View reviewed changes

splink/find_brs_with_comparison_counts_below_threshold.py Show resolved Hide resolved

ThomasHepworth requested changes Oct 25, 2023

View reviewed changes

ADBond mentioned this pull request Nov 8, 2023

Remove duplicate input_columns code #1715

Open

RobinL added 8 commits November 20, 2023 08:03

Merge branch 'master' into find_blocking_rules_that_create_comparison…

75d4cc4

…s_below_threshold

extend input columns to allow arguments

7f33e9d

name things more clearly

46a79c1

rename complexity for clarity

fd89e5e

Add error handling for no blocking rules below

9d5f2c1

threshold

clarify naming and docstring

17e6fec

lint

0463269

Merge branch 'master' into find_blocking_rules_that_create_comparison…

b9711ea

…s_below_threshold

ThomasHepworth reviewed Nov 20, 2023

View reviewed changes

splink/find_brs_with_comparison_counts_below_threshold.py Outdated Show resolved Hide resolved

ThomasHepworth approved these changes Nov 20, 2023

View reviewed changes

RobinL and others added 2 commits November 20, 2023 10:48

Update splink/find_brs_with_comparison_counts_below_threshold.py

ce46245

Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>

address Tom's final comments

d74db7e

RobinL merged commit 08bbe46 into master Nov 20, 2023
8 checks passed

RobinL deleted the find_blocking_rules_that_create_comparisons_below_threshold branch November 20, 2023 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finds blocking rules which return a comparison count below a given threshold #1665

Finds blocking rules which return a comparison count below a given threshold #1665

RobinL commented Oct 24, 2023 •

edited

Loading

ThomasHepworth Oct 25, 2023

ThomasHepworth Oct 25, 2023

RobinL Nov 20, 2023 •

edited

Loading

ThomasHepworth Oct 25, 2023

ThomasHepworth Oct 25, 2023 •

edited

Loading

RobinL Nov 20, 2023 •

edited

Loading

RobinL Nov 20, 2023

ThomasHepworth Oct 25, 2023 •

edited

Loading

ThomasHepworth Oct 25, 2023

RobinL Nov 20, 2023

ThomasHepworth Oct 25, 2023

RobinL Nov 20, 2023

ThomasHepworth left a comment

ThomasHepworth Nov 20, 2023

ThomasHepworth Nov 20, 2023

ThomasHepworth left a comment

		logger = logging.getLogger(__name__)


		def sanitise_column_name(column_name):

		├── c1 count_comparisons(c2, c1)
		│ └── c3 count_comparisons(c2, c1, c3)

Finds blocking rules which return a comparison count below a given threshold #1665

Finds blocking rules which return a comparison count below a given threshold #1665

Conversation

RobinL commented Oct 24, 2023 • edited Loading

Test script

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobinL Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasHepworth Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

RobinL Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasHepworth Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasHepworth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasHepworth left a comment

Choose a reason for hiding this comment

RobinL commented Oct 24, 2023 •

edited

Loading

RobinL Nov 20, 2023 •

edited

Loading

ThomasHepworth Oct 25, 2023 •

edited

Loading

RobinL Nov 20, 2023 •

edited

Loading

ThomasHepworth Oct 25, 2023 •

edited

Loading