Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Auto blocking rules #1464

Closed
wants to merge 14 commits into from
Closed

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Jul 24, 2023

Edit: Closing in favour of splitting this into the following three PRs:

#1665
#1667
#1668

The objective of this PR is to:

  • Find a list of blocking rules that produce a count of pairwise comparisons below a user-specified threshold
  • Find a list of blocking rules suitable for EM training that allows each field to vary at least n times
  • Find a list of blocking rules suitable for prediction that allows each field to vary at least n times

Here's my current strategy.

This is motivated by exploiting the new linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br) function that is super fast, so can evaluate large numbers of possible blocking rules quickly:

Let's say we have columns c1, c2, c3, etc.

We recurse through a tree running functions like:

c1                    count_comparisons(c1)           
├── c2                count_comparisons(c1, c2)
│   └── c3            count_comparisons(c1, c2, c3)     
├── c3                count_comparisons(c1, c3)
│   └── c2            count_comparisons(c1, c3, c2)  
c2                    count_comparisons(c2)          
├── c1                count_comparisons(c2, c1)       
│   └── c3            count_comparisons(c2, c1, c3)
├── c3                count_comparisons(c2, c3)
│   └── c1            count_comparisons(c2, c3, c1)
etc.

It skips running any count that has been generated before with the same columns in a different order

Once the count is below the threshold, no branches from the node are explored.

So for example, if count_comparisons(c1,c2) < threshold_max_rows, it will skip all children such as count_comparisons(c1,c2,c3). In practice, this means only a small subset of the full tree is explored, making things a lot faster.

Moving to the real outputs now with a threshold of 1m, this results in a table of results like:

blocking_rules comparison_count complexity
first_name, surname 533,375 2
first_name, dob 174,651 2
first_name, birth_place 344,971 2
first_name, postcode_fake 149,105 2
first_name, gender, surname 373,354 3
etc.
(Note the actual table has one hot encoding like this)
blocking_rules comparison_count complexity __fixed__first_name __fixed__surname __fixed__dob __fixed__birth_place __fixed__postcode_fake __fixed__gender __fixed__occupation
0 first_name, surname 533,375 2 1 1 0 0 0 0 0
1 first_name, dob 174,651 2 1 0 1 0 0 0 0
2 first_name, birth_place 344,971 2 1 0 0 1 0 0 0
3 first_name, postcode_fake 149,105 2 1 0 0 0 1 0 0
4 first_name, gender, surname 373,354 3 1 1 0 0 0 1 0

Once you have this table, we want an algorithm to find combinations of rows that satisfy some criterion.

Specifically, we're looking for an algorithm must satisfy the following constraint:

  • Between the rows selected, all columns should be allowed to vary by at least n of the blocking rules.

Amongst solutions that satisfy that constraint, the algorithm should minimise the following cost function:

  • Have a strong preference for selecting 'lower-complexity' rows (fewer variables being blocked on).
  • Have a strong preference for allowing each field to vary as many times as possible (i.e. fixing each individual field as few times as possible, e.g. if possible, don't hold the same field fixed more than once).

At the moment, the implementation of this is heuristic: it picks some random-ish rows from a list sorted by complexity until the constraint is met. It runs the cost function. Then repeats n times.

Todo:

  • need a check that if a combination has been skipped already because it's in a branch that doesn't need exploring, then skip it. This can happen if the combination has been observed earlier in a different order . Ie skip any combination that's nested within an existing combination that already achieved the threshold

Here is a testing script:

import pandas as pd
from IPython.display import display

from splink.autoblocking import (
    blocking_rules_for_prediction_report,
    suggest_blocking_rules_for_prediction,
)
from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k

df

pd.options.display.float_format = "{:,.0f}".format
# df = pd.read_parquet(
# "/Users/robinlinacre/Documents/data_linking/splink/synthetic_1m_clean.parquet"
# )
# df.head()


settings = {
    "link_type": "dedupe_only",
    "additional_columns_to_retain": ["cluster", "full_name", "first_and_surname"],
}
linker = DuckDBLinker(df, settings)


df_result_2 = linker._find_blocking_rules_below_threshold(1e6)
df_result_2["comparison_count"] = df_result_2["comparison_count"].astype(float)
display(df_result_2)

blocking_rule_suggestions = suggest_blocking_rules_for_prediction(df_result_2, 2, 5)

print(blocking_rules_for_prediction_report(blocking_rule_suggestions))
# df = pd.read_parquet(
# "/Users/robinlinacre/Documents/data_linking/splink/synthetic_1m_clean.parquet"
# )
# df.head()
# pd.options.display.max_columns = 1000
# pd.options.display.max_colwidth = 10000

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker


df = splink_datasets.historical_50k

settings = {
    "link_type": "dedupe_only",
    "additional_columns_to_retain": ["cluster", "full_name", "first_and_surname"],
}
linker = DuckDBLinker(df, settings)
linker._detect_blocking_rules_for_prediction(1e6, min_freedom=2)


# linker.save_model_to_json()


</details>

@RobinL RobinL marked this pull request as draft July 24, 2023 19:49
@github-actions
Copy link
Contributor

Test: test_2_rounds_1k_duckdb

Percentage change: -10.0%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1884 2023-07-24 19:50:37 1.70874 1.68708 (detached head) c1116ff Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0951 GHz c1116ff

Test: test_2_rounds_1k_sqlite

Percentage change: 0.8%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1886 2023-07-24 19:50:37 4.30223 4.2951 (detached head) c1116ff Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0951 GHz c1116ff

Click here for vega lite time series charts

@NickCrews
Copy link
Contributor

NickCrews commented Jul 27, 2023

@RobinL have you seen how https://www.github.com/dedupeio/dedupe does this?

@RobinL
Copy link
Member Author

RobinL commented Jul 27, 2023

No! - I probably should have looked! Do you know off the top of your head?

@NickCrews
Copy link
Contributor

I'll tag @fgregg here, in case they want to be generous enough here to weigh in on some of the lessons they learned with writing dedupe. Any thoughts or tips would be appreciated!

It's quite interesting. It requires labeled pairs, so the algorithm knows which pairs are true matches, but perhaps we could do some iterative process of doing EM to get some guesses as to matches, then learning some blocking rules, then repeating. Or at least there is some inspiration there that we could pull out.

First, read the high level description.

If you look more at dedupe, it can be a bit confusing because there are two layers to the block learner: The core one, in training.py, is used during the final training. But there is an active learning phase of dedupe, where it has to decide which pairs to present to a human for the human to label. That phase has its own block learner in labeler.py, which wraps the underlying one. But for our purposes, you should ignore the labeler.py, and just look at the core one in training.py.

Now that I'm looking again I don't see very good docstrings, so this is mostly coming from memory when I looked at this hard in the past, so I might not have this exactly right, but as I remember it, the basic algorithm is

  • pass in a set of candidate atomic blocking rules (eg l.name=r.name), the total data set, (and a third param that I don't quite understand) in the DedupeBlockLearner.__init__()
  • In, DedupeBlockLearner.learn(), pass in a collection of true-match pairs (from the manual labeling process) and a recall fractions between 0 and 1. The other two params are just configuration: index_predicates is a flag for whether to use a reverse index data structure to look up records from a blocking key, eg "find all records that have the substring 'abc'" (this is expensive so they feature flag it). The candidate_types flag is how to construct the compound blocking rules from the atomic rules that you passed in. I think it uses the "simple" (fast) mode during the labeling phase, when things need to be fast, and then it uses "random forest" during the final training phase when it can be slower but more accurate. It looks like in this PR you are sorta headed in the random forest direction.
  • Then, the learner begins generating compound candidate rules using the atomic blocking rule using one of the above strategies.
  • For each rule, it looks at 1. which true matches are included and 2. which false matches are included.
  • So now we have an optimization problem: select a subset of blocking rules that covers at least recall fraction of the true matches, but includes the smallest number of nonmatches. They claim to use Chvatal’s Greedy Set-Cover algorithm, but I didn't actually look at their implementation.

There are many great issues where the authors talk about their experiments and thoughts, we should add them here if we find them.

@RobinL
Copy link
Member Author

RobinL commented Jul 27, 2023

Thanks - that is super useful, really appreciate you taking the time. I will read properly when I find time to continue work on this PR

@RobinL
Copy link
Member Author

RobinL commented Jul 28, 2023

@NickCrews if you're interested I've updated the PR description (first message above) with the current strategy I'm using. I'm sure there's room for improvement but in practice this seems to be giving pretty good results already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants