Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable salting for EM training #1832

Merged
merged 3 commits into from
Jan 12, 2024
Merged

Enable salting for EM training #1832

merged 3 commits into from
Jan 12, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Jan 10, 2024

This PR enables the use of salting for EM training addressing #1817

This was previously arbitrarily denied on the basis it probably wasn't needed, but there's no real reason to disable it it, since it's opt-in (by default salting is not used)

I thought this may help with parallelisation in duckdb, but it's actually fully parallelised already. I have checked and salting doesn't make it any slower, but not really any faster either. It's almost fully parallelised already because

  • First we create the comparison vectors as a table once (not in parallel)
  • Then we used this table repeatedly to iterate the EM parameters. Because this is a large table with millions of rows, aggregations on it are parrallelised by default
  • These aggregations comprise most of the work
test code
from IPython.display import display
import time
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k
df.head()

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        block_on(["surname"]),
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("birth_place"),
        exact_match("postcode_fake"),
    ],
    "retain_intermediate_calculation_columns": True,
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker(df, settings)


br = block_on("first_name")

start_time = time.time()
linker.estimate_parameters_using_expectation_maximisation(br)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.1f} seconds")

linker = DuckDBLinker(df, settings)
br = block_on("first_name", salting_partitions=12)

start_time = time.time()
linker.estimate_parameters_using_expectation_maximisation(br)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.1f} seconds")

@RobinL RobinL changed the title Parallel em training Enable salting for EM trainin Jan 10, 2024
@RobinL RobinL changed the title Enable salting for EM trainin Enable salting for EM training Jan 10, 2024
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, makes sense 👍

@RobinL RobinL merged commit 3e74e95 into master Jan 12, 2024
8 of 10 checks passed
@RobinL RobinL deleted the parallel_em_training branch January 12, 2024 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants