Automatically detect blocking rules for prediction and blocking rules for EM training #1668

RobinL · 2023-10-24T16:39:57Z

This is the third PR in a series which will establish two new functions.

linker._detect_blocking_rules_for_prediction(
    max_comparisons_per_rule=1e3, min_freedom=1, 
)
linker._detect_blocking_rules_for_em_training(
    max_comparisons_per_rule=1e2, min_freedom=0
)

We have decided to leave these functions private (_) for now so we can test them out before introducing them to the public API

Links to other PRs;

In this third PR, I use the functions in the previous two to automatically detect good blocking rules.

The input (from the second PR) is a table like this:

blocking_rules	comparison_count	complexity
first_name, surname	533,375	2
first_name, dob	174,651	2
first_name, birth_place	344,971	2
first_name, postcode_fake	149,105	2
first_name, gender, surname	373,354	3
etc.

(Note the actual table has one hot encoding like this)

	blocking_rules	comparison_count	complexity	__fixed__first_name	__fixed__surname	__fixed__dob	__fixed__birth_place	__fixed__postcode_fake	__fixed__gender
0	first_name, surname	533,375	2	1	1	0	0	0	0
1	first_name, dob	174,651	2	1	0	1	0	0	0
2	first_name, birth_place	344,971	2	1	0	0	1	0	0
3	first_name, postcode_fake	149,105	2	1	0	0	0	1	0
4	first_name, gender, surname	373,354	3	1	1	0	0	0	1

Once you have this table, we want an algorithm to find combinations of rows that satisfy some criterion.

Specifically, we're looking for an algorithm must satisfy the following constraint:

Between the rows selected, all columns should be allowed to vary by at least n of the blocking rules.

Amongst solutions that satisfy that constraint, the algorithm should minimise the following cost function:

Have a strong preference for selecting 'lower-complexity' rows (fewer variables being blocked on).
Have a strong preference for allowing each field to vary as many times as possible (i.e. fixing each individual field as few times as possible, e.g. if possible, don't hold the same field fixed more than once).

At the moment, the implementation of this is heuristic: it picks some random-ish rows from a list sorted by complexity until the main 'must vary in at least n' constraint is met. It runs the cost function. Then repeats n times to choose the one with lowest cost.

It's fast to do this a large number of times, so I don't think there's any need to use a heavyweight tool like a SAT solver.

Test script:

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker


df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
}
linker = DuckDBLinker(df, settings)

df = splink_datasets.historical_50k

settings = {
    "link_type": "dedupe_only",
    "additional_columns_to_retain": ["cluster", "full_name", "first_and_surname"],
}
linker = DuckDBLinker(df, settings)
linker._detect_blocking_rules_for_prediction(1e6, min_freedom=2)
linker.cumulative_num_comparisons_from_blocking_rules_chart()
linker._detect_blocking_rules_for_em_training(1e6, min_freedom=1)

Output:

The following blocking_rules_to_generate_predictions were automatically detected and assigned to your settings:
block_on("birth_place", "occupation") 
block_on("first_name", "dob") 
block_on("gender", "postcode_fake")
The following EM training strategy was detected:
linker.estimate_parameters_using_expectation_maximisation(block_on("gender", "postcode_fake")) 
linker.estimate_parameters_using_expectation_maximisation(block_on("surname", "occupation"))

Spark test script:

from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession, types

from splink.spark.jar_location import similarity_jar_location
from splink.spark.spark_linker import SparkLinker

path = similarity_jar_location()


conf = SparkConf()


conf.set("spark.jars", path)


conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)


df = spark.read.parquet(
    "/Users/robinlinacre/Documents/data_linking/splink/synthetic_1m_clean.parquet"
)
df = df.repartition(8)

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
}


linker = SparkLinker(df, settings)


linker._detect_blocking_rules_for_prediction(1e8, min_freedom=1)

ThomasHepworth · 2023-11-20T19:13:40Z

The code currently breaks when 1=1 is selected as the blocking rule. This is simply because we are registering it here as a string, rather than a BlockingRule.

At present, we convert any string BRs -> BlockingRules in the settings.py script, so we can later dynamically assess if salting is needed.

I would either:

Return block_on("1') instead of 1=1
Use blocking_rule_to_obj to convert any outputs from these here.

Demo of break...

from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd

from tests.basic_settings import get_settings_dict

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

linker = DuckDBLinker(
    df,
    get_settings_dict(),
)
linker._detect_blocking_rules_for_prediction(1e6, min_freedom=2)

linker.predict()

ThomasHepworth

I'll double check this in the morning when I'm fresh, but this looks fantastic.

splink/optimise_cost_of_brs.py

splink/linker.py

splink/optimise_cost_of_brs.py

ThomasHepworth

I think this looks great! Most of the complicated logic seems to be from the previous two PRs, so I think I'm basically happy to tick this off once my comments have been addressed.

Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>

ThomasHepworth · 2023-11-22T10:53:56Z

Thanks Robin. If you get the tests to pass then I'm happy to approve this.

RobinL · 2023-11-22T11:07:32Z

The code currently breaks when 1=1 is selected as the blocking rule. This is simply because we are registering it here as a string, rather than a BlockingRule.

At present, we convert any string BRs -> BlockingRules in the settings.py script, so we can later dynamically assess if salting is needed.

I would either:

Return block_on("1') instead of 1=1

Use blocking_rule_to_obj to convert any outputs from these here.

Demo of break...

great spot, fixed here

ThomasHepworth

Thanks Robin, happy for you to merge this into master now.

I know you've already mentioned this to the team, but it's probably worth bringing it up again and seeing if someone in the team can crash test this on a live pipeline.

RobinL added 2 commits October 24, 2023 17:26

optimise the cost of blocking rules

f55912b

update params

f80b52e

RobinL requested a review from ThomasHepworth October 24, 2023 16:40

RobinL assigned ThomasHepworth Oct 24, 2023

RobinL mentioned this pull request Oct 24, 2023

(WIP) Auto blocking rules #1464

Closed

RobinL unassigned ThomasHepworth Oct 24, 2023

lint

1431650

RobinL changed the base branch from master to calculate_cost_of_brs October 24, 2023 16:56

Base automatically changed from calculate_cost_of_brs to master November 20, 2023 11:20

ThomasHepworth reviewed Nov 20, 2023

View reviewed changes

splink/optimise_cost_of_brs.py Outdated Show resolved Hide resolved

splink/optimise_cost_of_brs.py Outdated Show resolved Hide resolved

splink/linker.py Outdated Show resolved Hide resolved

ThomasHepworth reviewed Nov 21, 2023

View reviewed changes

splink/optimise_cost_of_brs.py Outdated Show resolved Hide resolved

ThomasHepworth requested changes Nov 21, 2023

View reviewed changes

RobinL and others added 2 commits November 22, 2023 10:34

Merge branch 'master' into optimise_cost_of_brs

e4a6b1b

Apply suggestions from code review

f1342d6

Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>

RobinL added 3 commits November 22, 2023 10:58

fixes

33461ff

Merge branch 'fixes' into optimise_cost_of_brs

47f61d6

fix 1=1 case

9e59056

lint

253f237

ThomasHepworth approved these changes Nov 22, 2023

View reviewed changes

RobinL merged commit 42de8da into master Nov 22, 2023
8 checks passed

RobinL deleted the optimise_cost_of_brs branch November 22, 2023 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically detect blocking rules for prediction and blocking rules for EM training #1668

Automatically detect blocking rules for prediction and blocking rules for EM training #1668

RobinL commented Oct 24, 2023 •

edited

Loading

ThomasHepworth commented Nov 20, 2023 •

edited

Loading

ThomasHepworth left a comment

ThomasHepworth left a comment

ThomasHepworth commented Nov 22, 2023

RobinL commented Nov 22, 2023

ThomasHepworth left a comment

Automatically detect blocking rules for prediction and blocking rules for EM training #1668

Automatically detect blocking rules for prediction and blocking rules for EM training #1668

Conversation

RobinL commented Oct 24, 2023 • edited Loading

ThomasHepworth commented Nov 20, 2023 • edited Loading

ThomasHepworth left a comment

Choose a reason for hiding this comment

ThomasHepworth left a comment

Choose a reason for hiding this comment

ThomasHepworth commented Nov 22, 2023

RobinL commented Nov 22, 2023

ThomasHepworth left a comment

Choose a reason for hiding this comment

RobinL commented Oct 24, 2023 •

edited

Loading

ThomasHepworth commented Nov 20, 2023 •

edited

Loading