-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finds blocking rules which return a comparison count below a given threshold #1665
Finds blocking rules which return a comparison count below a given threshold #1665
Conversation
splink/linker.py
Outdated
@@ -262,6 +262,29 @@ def _get_input_columns( | |||
|
|||
return column_names | |||
|
|||
@property | |||
def _column_names_as_input_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be getting duplicated all over the place.
Ideally we'll have a single method that handles the extraction of our input columns - #1611, that can then be extended to clean, process, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to approve this PR with this function in it, if you're ok with cleaning it up once Andy's original PR goes in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged in master
that includes Andy's work, and added arguments to his function so it's easy to filter out unique id columns and any specified additional_columns_to_retain
logger = logging.getLogger(__name__) | ||
|
||
|
||
def sanitise_column_name(column_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function may want to live inside misc.py.
It feels like something we'd want to use elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this cause issues for users that have "spacey column names"?
i.e. my column
is a valid column name in the rest of our code base, but would have the space stripped here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now named things better in this commit
The sanitisation here is purely to help establish a correspondence between one hot encoded columns __fixed__surname
and the blocking rules in blocking_columns
. I sanitise so I don't need to worry that the one hot encoded columns end up with special characters/spaces etc.
This also means spaces don't matter because the true blocking rule is held in splink_blocking_rule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
├── c1 count_comparisons(c2, c1) | ||
│ └── c3 count_comparisons(c2, c1, c3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth making it clear in this diagram that these have already been visited?
They're stored in the hashset, which should quickly trim them from next_combations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I may have misunderstood the order in which your loops work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
row = _generate_output_combinations_table_row( | ||
current_combination, | ||
br, | ||
comparison_count, | ||
all_columns, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not move this down to line 182? We're constructing the hashmap here and only using it in the else section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core functionality is amazing here!!
Some very minor tweaks needed and then I'll tick this off.
…s_below_threshold
…s_below_threshold
"""Generate a Splink blocking rule given a list of column names which | ||
are provided as as string""" | ||
|
||
dialect = linker._sql_dialect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile adding in a # TODO: remove in Splink4
tag here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a VSCode plugin that allows you to see all outstanding TODO
comments that would ensure we don't forget about them too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Robin - again, a really interesting DP problem to read through.
Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>
This is the first PR in a series which will establish two new functions.
We have decided to leave these functions private (
_
) for now so we can test them out before introducing them to the public APISee here for further description.
This PR adds the first ingredient: the ability to quickly and efficiently find a list of blocking rules that returns a comparison count below a certain threshold.
Here's my current strategy.
This is motivated by exploiting the new
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br)
function that is super fast, so can evaluate large numbers of possible blocking rules quickly:Let's say we have columns
c1
,c2
,c3
, etc.We recurse through a tree running functions like:
It skips running any count that has been generated before with the same columns in a different order
Once the count is below the threshold, no branches from the node are explored.
So for example, if
count_comparisons(c1,c2) < threshold_max_rows
, it will skip all children such ascount_comparisons(c1,c2,c3)
. In practice, this means only a small subset of the full tree is explored, making things a lot faster.Moving to the real outputs now with a threshold of 1m, this results in a table of results like:
(Note the actual table has one hot encoding like this)
Test script