Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finds blocking rules which return a comparison count below a given threshold #1665

Merged

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Oct 24, 2023

This is the first PR in a series which will establish two new functions.

linker._detect_blocking_rules_for_prediction(
    max_comparisons_per_rule=1e3, min_freedom=1, 
)
linker._detect_blocking_rules_for_em_training(
    max_comparisons_per_rule=1e2, min_freedom=0
)

We have decided to leave these functions private (_) for now so we can test them out before introducing them to the public API

See here for further description.

This PR adds the first ingredient: the ability to quickly and efficiently find a list of blocking rules that returns a comparison count below a certain threshold.

Here's my current strategy.

This is motivated by exploiting the new linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br) function that is super fast, so can evaluate large numbers of possible blocking rules quickly:

Let's say we have columns c1, c2, c3, etc.

We recurse through a tree running functions like:

c1                    count_comparisons(c1)           
├── c2                count_comparisons(c1, c2)
│   └── c3            count_comparisons(c1, c2, c3)     
├── c3                count_comparisons(c1, c3)
│   └── c2            count_comparisons(c1, c3, c2)  
c2                    count_comparisons(c2)          
├── c1                count_comparisons(c2, c1)       
│   └── c3            count_comparisons(c2, c1, c3)
├── c3                count_comparisons(c2, c3)
│   └── c1            count_comparisons(c2, c3, c1)
etc.

It skips running any count that has been generated before with the same columns in a different order

Once the count is below the threshold, no branches from the node are explored.

So for example, if count_comparisons(c1,c2) < threshold_max_rows, it will skip all children such as count_comparisons(c1,c2,c3). In practice, this means only a small subset of the full tree is explored, making things a lot faster.

Moving to the real outputs now with a threshold of 1m, this results in a table of results like:

blocking_rules comparison_count complexity
first_name, surname 533,375 2
first_name, dob 174,651 2
first_name, birth_place 344,971 2
first_name, postcode_fake 149,105 2
first_name, gender, surname 373,354 3
etc.
(Note the actual table has one hot encoding like this)
blocking_rules comparison_count complexity __fixed__first_name __fixed__surname __fixed__dob __fixed__birth_place __fixed__postcode_fake __fixed__gender __fixed__occupation
0 first_name, surname 533,375 2 1 1 0 0 0 0 0
1 first_name, dob 174,651 2 1 0 1 0 0 0 0
2 first_name, birth_place 344,971 2 1 0 0 1 0 0 0
3 first_name, postcode_fake 149,105 2 1 0 0 0 1 0 0
4 first_name, gender, surname 373,354 3 1 1 0 0 0 1 0

Test script

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
from splink.find_brs_with_comparison_counts_below_threshold import (
    find_blocking_rules_below_threshold_comparison_count,
)

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
}
linker = DuckDBLinker(df, settings)
linker._get_input_columns

find_blocking_rules_below_threshold_comparison_count(
    linker, max_comparisons_per_rule=10000
)

splink/linker.py Outdated
@@ -262,6 +262,29 @@ def _get_input_columns(

return column_names

@property
def _column_names_as_input_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be getting duplicated all over the place.

Ideally we'll have a single method that handles the extraction of our input columns - #1611, that can then be extended to clean, process, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to approve this PR with this function in it, if you're ok with cleaning it up once Andy's original PR goes in?

Copy link
Member Author

@RobinL RobinL Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged in master that includes Andy's work, and added arguments to his function so it's easy to filter out unique id columns and any specified additional_columns_to_retain

logger = logging.getLogger(__name__)


def sanitise_column_name(column_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function may want to live inside misc.py.

It feels like something we'd want to use elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cause issues for users that have "spacey column names"?

i.e. my column is a valid column name in the rest of our code base, but would have the space stripped here.

Copy link
Member Author

@RobinL RobinL Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now named things better in this commit

The sanitisation here is purely to help establish a correspondence between one hot encoded columns __fixed__surname and the blocking rules in blocking_columns. I sanitise so I don't need to worry that the one hot encoded columns end up with special characters/spaces etc.

This also means spaces don't matter because the true blocking rule is held in splink_blocking_rule

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To illustrate where this is used

image

Comment on lines +107 to +108
├── c1 count_comparisons(c2, c1)
│ └── c3 count_comparisons(c2, c1, c3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth making it clear in this diagram that these have already been visited?

They're stored in the hashset, which should quickly trim them from next_combations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I may have misunderstood the order in which your loops work)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 158 to 163
row = _generate_output_combinations_table_row(
current_combination,
br,
comparison_count,
all_columns,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not move this down to line 182? We're constructing the hashmap here and only using it in the else section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, done

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core functionality is amazing here!!

Some very minor tweaks needed and then I'll tick this off.

"""Generate a Splink blocking rule given a list of column names which
are provided as as string"""

dialect = linker._sql_dialect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile adding in a # TODO: remove in Splink4 tag here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a VSCode plugin that allows you to see all outstanding TODO comments that would ensure we don't forget about them too

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robin - again, a really interesting DP problem to read through.

RobinL and others added 2 commits November 20, 2023 10:48
Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>
@RobinL RobinL merged commit 08bbe46 into master Nov 20, 2023
8 checks passed
@RobinL RobinL deleted the find_blocking_rules_that_create_comparisons_below_threshold branch November 20, 2023 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants