-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically detect blocking rules for prediction and blocking rules for EM training #1668
Conversation
The code currently breaks when At present, we convert any string BRs -> BlockingRules in the I would either:
Demo of break...from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd
from tests.basic_settings import get_settings_dict
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
linker = DuckDBLinker(
df,
get_settings_dict(),
)
linker._detect_blocking_rules_for_prediction(1e6, min_freedom=2)
linker.predict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll double check this in the morning when I'm fresh, but this looks fantastic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great! Most of the complicated logic seems to be from the previous two PRs, so I think I'm basically happy to tick this off once my comments have been addressed.
Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>
Thanks Robin. If you get the tests to pass then I'm happy to approve this. |
great spot, fixed here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Robin, happy for you to merge this into master now.
I know you've already mentioned this to the team, but it's probably worth bringing it up again and seeing if someone in the team can crash test this on a live pipeline.
This is the third PR in a series which will establish two new functions.
We have decided to leave these functions private (
_
) for now so we can test them out before introducing them to the public APILinks to other PRs;
In this third PR, I use the functions in the previous two to automatically detect good blocking rules.
The input (from the second PR) is a table like this:
(Note the actual table has one hot encoding like this)
Once you have this table, we want an algorithm to find combinations of rows that satisfy some criterion.
Specifically, we're looking for an algorithm must satisfy the following constraint:
Amongst solutions that satisfy that constraint, the algorithm should minimise the following cost function:
At the moment, the implementation of this is heuristic: it picks some random-ish rows from a list sorted by complexity until the main 'must vary in at least n' constraint is met. It runs the cost function. Then repeats n times to choose the one with lowest cost.
It's fast to do this a large number of times, so I don't think there's any need to use a heavyweight tool like a SAT solver.
Test script:
Output:
Spark test script: