Enable salting for EM training #1832

RobinL · 2024-01-10T14:21:52Z

This PR enables the use of salting for EM training addressing #1817

This was previously arbitrarily denied on the basis it probably wasn't needed, but there's no real reason to disable it it, since it's opt-in (by default salting is not used)

I thought this may help with parallelisation in duckdb, but it's actually fully parallelised already. I have checked and salting doesn't make it any slower, but not really any faster either. It's almost fully parallelised already because

First we create the comparison vectors as a table once (not in parallel)
Then we used this table repeatedly to iterate the EM parameters. Because this is a large table with millions of rows, aggregations on it are parrallelised by default
These aggregations comprise most of the work

test code

from IPython.display import display
import time
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k
df.head()

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        block_on(["surname"]),
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("birth_place"),
        exact_match("postcode_fake"),
    ],
    "retain_intermediate_calculation_columns": True,
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker(df, settings)


br = block_on("first_name")

start_time = time.time()
linker.estimate_parameters_using_expectation_maximisation(br)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.1f} seconds")

linker = DuckDBLinker(df, settings)
br = block_on("first_name", salting_partitions=12)

start_time = time.time()
linker.estimate_parameters_using_expectation_maximisation(br)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.1f} seconds")

… classes

ADBond

Great, makes sense 👍

RobinL added 2 commits January 10, 2024 14:21

Refactor blocking rule initialization in EMTrainingSession and Linker…

5c07616

… classes

Merge branch 'master' into parallel_em_training

87fc2de

RobinL changed the title ~~Parallel em training~~ Enable salting for EM trainin Jan 10, 2024

update changelog

886a78e

RobinL changed the title ~~Enable salting for EM trainin~~ Enable salting for EM training Jan 10, 2024

ADBond approved these changes Jan 12, 2024

View reviewed changes

RobinL merged commit 3e74e95 into master Jan 12, 2024
8 of 10 checks passed

RobinL deleted the parallel_em_training branch January 12, 2024 10:19

This was referenced Jan 17, 2024

EM silently ignores salting #1817

Closed

Add support for SaltedBlockingRule for EM training (again) #1853

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable salting for EM training #1832

Enable salting for EM training #1832

RobinL commented Jan 10, 2024 •

edited

ADBond left a comment

Enable salting for EM training #1832

Enable salting for EM training #1832

Conversation

RobinL commented Jan 10, 2024 • edited

ADBond left a comment

Choose a reason for hiding this comment

RobinL commented Jan 10, 2024 •

edited