-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(WIP) Auto blocking rules #1464
Conversation
Test: test_2_rounds_1k_duckdbPercentage change: -10.0%
Test: test_2_rounds_1k_sqlitePercentage change: 0.8%
Click here for vega lite time series charts |
@RobinL have you seen how https://www.github.com/dedupeio/dedupe does this? |
No! - I probably should have looked! Do you know off the top of your head? |
I'll tag @fgregg here, in case they want to be generous enough here to weigh in on some of the lessons they learned with writing dedupe. Any thoughts or tips would be appreciated! It's quite interesting. It requires labeled pairs, so the algorithm knows which pairs are true matches, but perhaps we could do some iterative process of doing EM to get some guesses as to matches, then learning some blocking rules, then repeating. Or at least there is some inspiration there that we could pull out. First, read the high level description. If you look more at dedupe, it can be a bit confusing because there are two layers to the block learner: The core one, in training.py, is used during the final training. But there is an active learning phase of dedupe, where it has to decide which pairs to present to a human for the human to label. That phase has its own block learner in labeler.py, which wraps the underlying one. But for our purposes, you should ignore the labeler.py, and just look at the core one in training.py. Now that I'm looking again I don't see very good docstrings, so this is mostly coming from memory when I looked at this hard in the past, so I might not have this exactly right, but as I remember it, the basic algorithm is
There are many great issues where the authors talk about their experiments and thoughts, we should add them here if we find them. |
Thanks - that is super useful, really appreciate you taking the time. I will read properly when I find time to continue work on this PR |
@NickCrews if you're interested I've updated the PR description (first message above) with the current strategy I'm using. I'm sure there's room for improvement but in practice this seems to be giving pretty good results already |
Edit: Closing in favour of splitting this into the following three PRs:
#1665
#1667
#1668
The objective of this PR is to:
Here's my current strategy.
This is motivated by exploiting the new
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br)
function that is super fast, so can evaluate large numbers of possible blocking rules quickly:Let's say we have columns
c1
,c2
,c3
, etc.We recurse through a tree running functions like:
It skips running any count that has been generated before with the same columns in a different order
Once the count is below the threshold, no branches from the node are explored.
So for example, if
count_comparisons(c1,c2) < threshold_max_rows
, it will skip all children such ascount_comparisons(c1,c2,c3)
. In practice, this means only a small subset of the full tree is explored, making things a lot faster.Moving to the real outputs now with a threshold of 1m, this results in a table of results like:
(Note the actual table has one hot encoding like this)
Once you have this table, we want an algorithm to find combinations of rows that satisfy some criterion.
Specifically, we're looking for an algorithm must satisfy the following constraint:
Amongst solutions that satisfy that constraint, the algorithm should minimise the following cost function:
At the moment, the implementation of this is heuristic: it picks some random-ish rows from a list sorted by complexity until the constraint is met. It runs the cost function. Then repeats n times.
Todo:
Here is a testing script: