Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect blocking rules for prediction and blocking rules for EM training #1668

Merged
merged 9 commits into from
Nov 22, 2023

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Oct 24, 2023

This is the third PR in a series which will establish two new functions.

linker._detect_blocking_rules_for_prediction(
    max_comparisons_per_rule=1e3, min_freedom=1, 
)
linker._detect_blocking_rules_for_em_training(
    max_comparisons_per_rule=1e2, min_freedom=0
)

We have decided to leave these functions private (_) for now so we can test them out before introducing them to the public API

Links to other PRs;

  1. Find blocking rules returning a comparison count below a threshold
  2. Compute the cost of combinations of blocking rules

In this third PR, I use the functions in the previous two to automatically detect good blocking rules.

The input (from the second PR) is a table like this:

blocking_rules comparison_count complexity
first_name, surname 533,375 2
first_name, dob 174,651 2
first_name, birth_place 344,971 2
first_name, postcode_fake 149,105 2
first_name, gender, surname 373,354 3
etc.
(Note the actual table has one hot encoding like this)
blocking_rules comparison_count complexity __fixed__first_name __fixed__surname __fixed__dob __fixed__birth_place __fixed__postcode_fake __fixed__gender __fixed__occupation
0 first_name, surname 533,375 2 1 1 0 0 0 0 0
1 first_name, dob 174,651 2 1 0 1 0 0 0 0
2 first_name, birth_place 344,971 2 1 0 0 1 0 0 0
3 first_name, postcode_fake 149,105 2 1 0 0 0 1 0 0
4 first_name, gender, surname 373,354 3 1 1 0 0 0 1 0

Once you have this table, we want an algorithm to find combinations of rows that satisfy some criterion.

Specifically, we're looking for an algorithm must satisfy the following constraint:

  • Between the rows selected, all columns should be allowed to vary by at least n of the blocking rules.

Amongst solutions that satisfy that constraint, the algorithm should minimise the following cost function:

  • Have a strong preference for selecting 'lower-complexity' rows (fewer variables being blocked on).
  • Have a strong preference for allowing each field to vary as many times as possible (i.e. fixing each individual field as few times as possible, e.g. if possible, don't hold the same field fixed more than once).

At the moment, the implementation of this is heuristic: it picks some random-ish rows from a list sorted by complexity until the main 'must vary in at least n' constraint is met. It runs the cost function. Then repeats n times to choose the one with lowest cost.

It's fast to do this a large number of times, so I don't think there's any need to use a heavyweight tool like a SAT solver.

Test script:

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker


df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
}
linker = DuckDBLinker(df, settings)

df = splink_datasets.historical_50k

settings = {
    "link_type": "dedupe_only",
    "additional_columns_to_retain": ["cluster", "full_name", "first_and_surname"],
}
linker = DuckDBLinker(df, settings)
linker._detect_blocking_rules_for_prediction(1e6, min_freedom=2)
linker.cumulative_num_comparisons_from_blocking_rules_chart()
linker._detect_blocking_rules_for_em_training(1e6, min_freedom=1)

Output:

The following blocking_rules_to_generate_predictions were automatically detected and assigned to your settings:
block_on("birth_place", "occupation") 
block_on("first_name", "dob") 
block_on("gender", "postcode_fake")
The following EM training strategy was detected:
linker.estimate_parameters_using_expectation_maximisation(block_on("gender", "postcode_fake")) 
linker.estimate_parameters_using_expectation_maximisation(block_on("surname", "occupation"))

Spark test script:

from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession, types

from splink.spark.jar_location import similarity_jar_location
from splink.spark.spark_linker import SparkLinker

path = similarity_jar_location()


conf = SparkConf()


conf.set("spark.jars", path)


conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)


df = spark.read.parquet(
    "/Users/robinlinacre/Documents/data_linking/splink/synthetic_1m_clean.parquet"
)
df = df.repartition(8)

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
}


linker = SparkLinker(df, settings)


linker._detect_blocking_rules_for_prediction(1e8, min_freedom=1)

@RobinL RobinL changed the base branch from master to calculate_cost_of_brs October 24, 2023 16:56
Base automatically changed from calculate_cost_of_brs to master November 20, 2023 11:20
@ThomasHepworth
Copy link
Contributor

ThomasHepworth commented Nov 20, 2023

The code currently breaks when 1=1 is selected as the blocking rule. This is simply because we are registering it here as a string, rather than a BlockingRule.

At present, we convert any string BRs -> BlockingRules in the settings.py script, so we can later dynamically assess if salting is needed.

I would either:

  1. Return block_on("1') instead of 1=1
  2. Use blocking_rule_to_obj to convert any outputs from these here.
Demo of break...
from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd

from tests.basic_settings import get_settings_dict

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

linker = DuckDBLinker(
    df,
    get_settings_dict(),
)
linker._detect_blocking_rules_for_prediction(1e6, min_freedom=2)

linker.predict()

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check this in the morning when I'm fresh, but this looks fantastic.

splink/optimise_cost_of_brs.py Outdated Show resolved Hide resolved
splink/optimise_cost_of_brs.py Outdated Show resolved Hide resolved
splink/linker.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great! Most of the complicated logic seems to be from the previous two PRs, so I think I'm basically happy to tick this off once my comments have been addressed.

RobinL and others added 2 commits November 22, 2023 10:34
Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>
@ThomasHepworth
Copy link
Contributor

Thanks Robin. If you get the tests to pass then I'm happy to approve this.

@RobinL
Copy link
Member Author

RobinL commented Nov 22, 2023

The code currently breaks when 1=1 is selected as the blocking rule. This is simply because we are registering it here as a string, rather than a BlockingRule.

At present, we convert any string BRs -> BlockingRules in the settings.py script, so we can later dynamically assess if salting is needed.

I would either:

  1. Return block_on("1') instead of 1=1
  2. Use blocking_rule_to_obj to convert any outputs from these here.

Demo of break...

great spot, fixed here

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robin, happy for you to merge this into master now.

I know you've already mentioned this to the team, but it's probably worth bringing it up again and seeing if someone in the team can crash test this on a live pipeline.

@RobinL RobinL merged commit 42de8da into master Nov 22, 2023
8 checks passed
@RobinL RobinL deleted the optimise_cost_of_brs branch November 22, 2023 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants