(WIP) Auto blocking rules #1464

RobinL · 2023-07-24T19:49:12Z

Edit: Closing in favour of splitting this into the following three PRs:

The objective of this PR is to:

Find a list of blocking rules that produce a count of pairwise comparisons below a user-specified threshold
Find a list of blocking rules suitable for EM training that allows each field to vary at least n times
Find a list of blocking rules suitable for prediction that allows each field to vary at least n times

Here's my current strategy.

This is motivated by exploiting the new linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br) function that is super fast, so can evaluate large numbers of possible blocking rules quickly:

Let's say we have columns c1, c2, c3, etc.

We recurse through a tree running functions like:

c1                    count_comparisons(c1)           
├── c2                count_comparisons(c1, c2)
│   └── c3            count_comparisons(c1, c2, c3)     
├── c3                count_comparisons(c1, c3)
│   └── c2            count_comparisons(c1, c3, c2)  
c2                    count_comparisons(c2)          
├── c1                count_comparisons(c2, c1)       
│   └── c3            count_comparisons(c2, c1, c3)
├── c3                count_comparisons(c2, c3)
│   └── c1            count_comparisons(c2, c3, c1)
etc.

It skips running any count that has been generated before with the same columns in a different order

Once the count is below the threshold, no branches from the node are explored.

So for example, if count_comparisons(c1,c2) < threshold_max_rows, it will skip all children such as count_comparisons(c1,c2,c3). In practice, this means only a small subset of the full tree is explored, making things a lot faster.

Moving to the real outputs now with a threshold of 1m, this results in a table of results like:

blocking_rules	comparison_count	complexity
first_name, surname	533,375	2
first_name, dob	174,651	2
first_name, birth_place	344,971	2
first_name, postcode_fake	149,105	2
first_name, gender, surname	373,354	3
etc.

(Note the actual table has one hot encoding like this)

	blocking_rules	comparison_count	complexity	__fixed__first_name	__fixed__surname	__fixed__dob	__fixed__birth_place	__fixed__postcode_fake	__fixed__gender
0	first_name, surname	533,375	2	1	1	0	0	0	0
1	first_name, dob	174,651	2	1	0	1	0	0	0
2	first_name, birth_place	344,971	2	1	0	0	1	0	0
3	first_name, postcode_fake	149,105	2	1	0	0	0	1	0
4	first_name, gender, surname	373,354	3	1	1	0	0	0	1

Once you have this table, we want an algorithm to find combinations of rows that satisfy some criterion.

Specifically, we're looking for an algorithm must satisfy the following constraint:

Between the rows selected, all columns should be allowed to vary by at least n of the blocking rules.

Amongst solutions that satisfy that constraint, the algorithm should minimise the following cost function:

Have a strong preference for selecting 'lower-complexity' rows (fewer variables being blocked on).
Have a strong preference for allowing each field to vary as many times as possible (i.e. fixing each individual field as few times as possible, e.g. if possible, don't hold the same field fixed more than once).

At the moment, the implementation of this is heuristic: it picks some random-ish rows from a list sorted by complexity until the constraint is met. It runs the cost function. Then repeats n times.

Todo:

need a check that if a combination has been skipped already because it's in a branch that doesn't need exploring, then skip it. This can happen if the combination has been observed earlier in a different order . Ie skip any combination that's nested within an existing combination that already achieved the threshold

Here is a testing script:

import pandas as pd
from IPython.display import display

from splink.autoblocking import (
    blocking_rules_for_prediction_report,
    suggest_blocking_rules_for_prediction,
)
from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k

df

pd.options.display.float_format = "{:,.0f}".format
# df = pd.read_parquet(
# "/Users/robinlinacre/Documents/data_linking/splink/synthetic_1m_clean.parquet"
# )
# df.head()


settings = {
    "link_type": "dedupe_only",
    "additional_columns_to_retain": ["cluster", "full_name", "first_and_surname"],
}
linker = DuckDBLinker(df, settings)


df_result_2 = linker._find_blocking_rules_below_threshold(1e6)
df_result_2["comparison_count"] = df_result_2["comparison_count"].astype(float)
display(df_result_2)

blocking_rule_suggestions = suggest_blocking_rules_for_prediction(df_result_2, 2, 5)

print(blocking_rules_for_prediction_report(blocking_rule_suggestions))

# df = pd.read_parquet(
# "/Users/robinlinacre/Documents/data_linking/splink/synthetic_1m_clean.parquet"
# )
# df.head()
# pd.options.display.max_columns = 1000
# pd.options.display.max_colwidth = 10000

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker


df = splink_datasets.historical_50k

settings = {
    "link_type": "dedupe_only",
    "additional_columns_to_retain": ["cluster", "full_name", "first_and_surname"],
}
linker = DuckDBLinker(df, settings)
linker._detect_blocking_rules_for_prediction(1e6, min_freedom=2)


# linker.save_model_to_json()


</details>

github-actions · 2023-07-24T19:50:41Z

Test: test_2_rounds_1k_duckdb

Percentage change: -10.0%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1884	2023-07-24	19:50:37	1.70874	1.68708	(detached head)	`c1116ff`	Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz	2.0951 GHz	`c1116ff`

Test: test_2_rounds_1k_sqlite

Percentage change: 0.8%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1886	2023-07-24	19:50:37	4.30223	4.2951	(detached head)	`c1116ff`	Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz	2.0951 GHz	`c1116ff`

Click here for vega lite time series charts

NickCrews · 2023-07-27T17:13:47Z

@RobinL have you seen how https://www.github.com/dedupeio/dedupe does this?

RobinL · 2023-07-27T17:21:32Z

No! - I probably should have looked! Do you know off the top of your head?

NickCrews · 2023-07-27T19:07:40Z

I'll tag @fgregg here, in case they want to be generous enough here to weigh in on some of the lessons they learned with writing dedupe. Any thoughts or tips would be appreciated!

It's quite interesting. It requires labeled pairs, so the algorithm knows which pairs are true matches, but perhaps we could do some iterative process of doing EM to get some guesses as to matches, then learning some blocking rules, then repeating. Or at least there is some inspiration there that we could pull out.

First, read the high level description.

If you look more at dedupe, it can be a bit confusing because there are two layers to the block learner: The core one, in training.py, is used during the final training. But there is an active learning phase of dedupe, where it has to decide which pairs to present to a human for the human to label. That phase has its own block learner in labeler.py, which wraps the underlying one. But for our purposes, you should ignore the labeler.py, and just look at the core one in training.py.

Now that I'm looking again I don't see very good docstrings, so this is mostly coming from memory when I looked at this hard in the past, so I might not have this exactly right, but as I remember it, the basic algorithm is

pass in a set of candidate atomic blocking rules (eg l.name=r.name), the total data set, (and a third param that I don't quite understand) in the DedupeBlockLearner.__init__()
In, DedupeBlockLearner.learn(), pass in a collection of true-match pairs (from the manual labeling process) and a recall fractions between 0 and 1. The other two params are just configuration: index_predicates is a flag for whether to use a reverse index data structure to look up records from a blocking key, eg "find all records that have the substring 'abc'" (this is expensive so they feature flag it). The candidate_types flag is how to construct the compound blocking rules from the atomic rules that you passed in. I think it uses the "simple" (fast) mode during the labeling phase, when things need to be fast, and then it uses "random forest" during the final training phase when it can be slower but more accurate. It looks like in this PR you are sorta headed in the random forest direction.
Then, the learner begins generating compound candidate rules using the atomic blocking rule using one of the above strategies.
For each rule, it looks at 1. which true matches are included and 2. which false matches are included.
So now we have an optimization problem: select a subset of blocking rules that covers at least recall fraction of the true matches, but includes the smallest number of nonmatches. They claim to use Chvatal’s Greedy Set-Cover algorithm, but I didn't actually look at their implementation.

There are many great issues where the authors talk about their experiments and thoughts, we should add them here if we find them.

RobinL · 2023-07-27T19:16:55Z

Thanks - that is super useful, really appreciate you taking the time. I will read properly when I find time to continue work on this PR

RobinL · 2023-07-28T09:12:54Z

@NickCrews if you're interested I've updated the PR description (first message above) with the current strategy I'm using. I'm sure there's room for improvement but in practice this seems to be giving pretty good results already

RobinL added 5 commits July 23, 2023 16:38

search blocks

d3fb910

wip

36db20d

seems to work

b03de22

clearer var names

668f648

wip

42ec640

RobinL marked this pull request as draft July 24, 2023 19:49

RobinL added 9 commits October 23, 2023 08:11

unformat to use as number

2b5bef0

Merge branch 'master' into find_blocking_rules_below_threshold

2fe4254

better naming

e2305ff

add report

b0fae2a

better scoring

afdba5e

update weights

a64fe7e

detect em training rules

3579fcb

detect blocking rules for em training

0467488

protect against no blocking rules being found

cf952ab

RobinL mentioned this pull request Oct 24, 2023

Finds blocking rules which return a comparison count below a given threshold #1665

Merged

RobinL closed this Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Auto blocking rules #1464

(WIP) Auto blocking rules #1464

RobinL commented Jul 24, 2023 •

edited

Loading

github-actions bot commented Jul 24, 2023

NickCrews commented Jul 27, 2023 •

edited

Loading

RobinL commented Jul 27, 2023 •

edited

Loading

NickCrews commented Jul 27, 2023

RobinL commented Jul 27, 2023

RobinL commented Jul 28, 2023

(WIP) Auto blocking rules #1464

(WIP) Auto blocking rules #1464

Conversation

RobinL commented Jul 24, 2023 • edited Loading

github-actions bot commented Jul 24, 2023

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

NickCrews commented Jul 27, 2023 • edited Loading

RobinL commented Jul 27, 2023 • edited Loading

NickCrews commented Jul 27, 2023

RobinL commented Jul 27, 2023

RobinL commented Jul 28, 2023

RobinL commented Jul 24, 2023 •

edited

Loading

NickCrews commented Jul 27, 2023 •

edited

Loading

RobinL commented Jul 27, 2023 •

edited

Loading