[FEAT] Detect equi-join conditions in a blocking rule to count the number of comparisons without needing to perform the join #1388

RobinL · 2023-07-02T16:03:46Z

Is your Pull Request linked to an existing Issue or Pull Request?

Will close #1376

Will also set the groundwork for a future method like linker.suggest_blocking_rules(target_comparisons=x)

Give a brief description for the solution you have provided

See the #1376 issue for more detail:

Use SQLGlot to extract the equi-join conditions
Write a SQL query that uses these conditions to 'forecast' the count_num_comparisons_from_blocking_rule

To do:

Add tests
Add tests for different dialects, making sure to use dialect-specific syntax in the blocking rule
Manually try and actual example in Spark to double check spark-specific syntax works fine
Make sure it works for all link types
Move code to analyse_blocking.py

PR Checklist

Added documentation for changes. 👈 Not yet, since it's currently a private (deliberately undocumented) feature
Added tests (if appropriate)
Made changes based off the latest version of Splink
Run the linter

github-actions · 2023-07-02T16:05:00Z

Test: test_2_rounds_1k_duckdb

Percentage change: -26.1%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1827	2023-07-08	16:24:16	1.44801	1.38517	(detached head)	`93c06d1`	Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz	2.2947 GHz	`93c06d1`

Test: test_2_rounds_1k_sqlite

Percentage change: -24.1%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1829	2023-07-08	16:24:16	3.24907	3.23346	(detached head)	`93c06d1`	Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz	2.2947 GHz	`93c06d1`

Click here for vega lite time series charts

RobinL · 2023-07-02T17:02:50Z

As expected, the new methodology is dramatically quicker, especially on loose blocking conditions (which are the ones which are most problematic and we most need to be able to analyse easily):

Test script

df = pd.read_parquet(
    "/Users/robinlinacre/Documents/data_linking/splink/synthetic_1m_clean.parquet"
)



blocking_rules = [
    "l.first_name = r.first_name",
    "l.first_name = r.first_name AND l.surname = r.surname",
    "substr(l.first_name,2,3) = substr(r.first_name,3,4)",
    "substr(l.first_name,1,3) = substr(r.first_name,1,3)",

]

settings = {"link_type": "dedupe_only"}

linker = DuckDBLinker(df, settings)
for br in blocking_rules:
    print("-------")
    print(br)

    start_time = datetime.datetime.now()
    print(linker.count_num_comparisons_from_blocking_rule_2(br))
    end_time = datetime.datetime.now()
    time_new = end_time - start_time
    print("Time taken for count_num_comparisons_from_blocking_rule_2: ", time_new)

    start_time = datetime.datetime.now()
    print(linker.count_num_comparisons_from_blocking_rule(br))
    end_time = datetime.datetime.now()
    time_old = end_time - start_time
    print(
        "Time taken for count_num_comparisons_from_blocking_rule: ",
        time_old,
    )

    print(f"Speed multipler: {(time_old/time_new):,.1f}")

Output:

-------
l.first_name = r.first_name
5146097324.0
Time taken for count_num_comparisons_from_blocking_rule_2:  0:00:00.166095
2572496060
Time taken for count_num_comparisons_from_blocking_rule:  0:00:30.821882
🔥🔥Speed multipler: 185.6🔥🔥
-------
l.first_name = r.first_name AND l.surname = r.surname
12628073.0
Time taken for count_num_comparisons_from_blocking_rule_2:  0:00:00.182770
5807175
Time taken for count_num_comparisons_from_blocking_rule:  0:00:00.435371
🔥🔥Speed multipler: 2.4🔥🔥
-------
substr(l.first_name,2,3) = substr(r.first_name,3,4)
455003494.0
Time taken for count_num_comparisons_from_blocking_rule_2:  0:00:00.074847
228368393
Time taken for count_num_comparisons_from_blocking_rule:  0:00:06.597419
🔥🔥Speed multipler: 88.1🔥🔥
-------
substr(l.first_name,1,3) = substr(r.first_name,1,3)
9884766450.0
Time taken for count_num_comparisons_from_blocking_rule_2:  0:00:00.071921
4941830623
Time taken for count_num_comparisons_from_blocking_rule:  0:00:33.487289
🔥🔥Speed multipler: 465.6🔥🔥

RobinL · 2023-07-04T13:37:21Z

Spark test script:

from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession

from splink.spark.linker import SparkLinker

conf = SparkConf()
conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)

settings = {
    "link_type": "dedupe_only",
}

df = spark.read.csv("./tests/datasets/fake_1000_from_splink_demos.csv", header=True)

# df = df.withColumn("hi THERE", "email")

df = df.withColumnRenamed("surname", "hi THERE")

linker = SparkLinker(df, input_table_aliases="fake_data_1")
br = "l.`first_name` = r.`first_name`"
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br)

br = "l.first_name = r.`first_name` and l.`hi THERE` = r.`hi THERE`"
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br)

splink/blocking.py

ThomasHepworth · 2023-07-06T15:37:17Z

splink/analyse_blocking.py

+def count_comparisons_from_blocking_rule_pre_filter_conditions_sqls(
+    linker: "Linker", blocking_rule: Union[str, "BlockingRule"]
+):
+    if isinstance(blocking_rule, str):


FYI, you'll be able to use this new conversion function once some of my BR work has been merged in.

splink/analyse_blocking.py

Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>

splink/blocking.py

splink/linker.py

ThomasHepworth · 2023-07-07T12:55:13Z

splink/analyse_blocking.py

+    if not join_conditions:
+        if linker._two_dataset_link_only:
+            sql = f"""
+            SELECT
+                (SELECT COUNT(*) FROM {input_tablename_l})
+                *
+                (SELECT COUNT(*) FROM {input_tablename_r})
+                    AS count_of_pairwise_comparisons_generated
+            """
+        else:
+            sql = """
+            select count(*) * count(*) as count_of_pairwise_comparisons_generated
+            from __splink__df_concat
+
+            """


This appears to be broken for "stacked" dataframes at present - i.e. a link job where we input a single dataframe with a column detailing which records belongs to which dataset.

from splink.duckdb.linker import DuckDBLinker import pandas as pd df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv") df['source_dataset'] = pd.Series(['a'] * 500 + ['b'] * 500, name='source_dataset') linker = DuckDBLinker( df, settings_dict = { "link_type": "link_only", } ) linker._initialise_df_concat(True).as_pandas_dataframe() linker._two_dataset_link_only # False br = "levenshtein(first_name, 3)" display( linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions(br) ) # expecting 25k

Ouch, good spot, I will try and figure out how to fix.

Oh wait - I don't think this is actually broken. It's a little difficult to explain, but let me try:

link_only works in the general n dataset case, including n==2

In the general case, a self-join of the concatenated datasets is used to generate comparisons, filtering out records where source_dataset_l = source_dataset_r

There is a specific optimisation where splink observes two input datasets that allows an inner join of the raw (non concatenated) input datasets

This optimisation is not used if you pass Splink a single dataframe that implicitly has two source datasets

1,000,000 is therefore the correct result from your code example because the optimisation is not being used

Separately, there is something a bit weird here for which I will raise a different issue. Consider running the following on master:

from splink.duckdb.linker import DuckDBLinker import pandas as pd df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv") df['source_dataset'] = pd.Series(['a'] * 500 + ['b'] * 500, name='source_dataset') linker = DuckDBLinker( df, settings_dict = { "link_type": "link_and_dedupe" } ) import logging logging.getLogger("splink").setLevel(1) linker._initialise_df_concat_with_tf(True).as_pandas_dataframe()

You still end up with a column called __splink_source_dataset, which seems unnecessary

I will raise an issue - I think this probably needs to be fixed separately

Yeah, it's definitely unnecessary.

The fix that I put in place was merely intended as a bandage before a more permanent solution was implemented.

Apologies, I should've made an issue at the time.

ThomasHepworth · 2023-07-07T12:56:19Z

I'm essentially happy to approve this once the "stacked dataframe" problem has been resolved. That may require its own PR, because some of the logic may need to be rewritten.

Apologies, I meant to clean up some of that code a while back, but never found the time.

Needed because sqlglot.optimizer.eliminate_joins.join_condition is not available prior to v7

ThomasHepworth

Thanks Robin, this is great!

Some caveats to be dealt with in a later PR.

extract join conditions

c0b777d

RobinL linked an issue Jul 2, 2023 that may be closed by this pull request

[FEAT] Detect equi-join conditions in a blocking rule to count the number of comparisons without needing to perform the join #1376

Closed

RobinL added 2 commits July 2, 2023 17:20

add filter condition and dialect

0f98e3e

add filter condition and dialect

6e55306

RobinL marked this pull request as draft July 2, 2023 16:22

RobinL added 8 commits July 2, 2023 18:05

working code, but allowing for speed comparison

204cdcc

move sql to analyse_blocking.py

d40fd6d

works with two dataset link only

e2f6789

add tests

d90c62f

lint

d259709

better function names

03d44ee

document better

3ef093d

dialect support

03ffe08

RobinL added 2 commits July 4, 2023 14:48

fix dialect support

653aa48

lint

462d5f6

RobinL marked this pull request as ready for review July 4, 2023 14:18

RobinL requested a review from ThomasHepworth July 5, 2023 06:44

ThomasHepworth reviewed Jul 5, 2023

View reviewed changes

splink/blocking.py Show resolved Hide resolved

ThomasHepworth reviewed Jul 6, 2023

View reviewed changes

splink/analyse_blocking.py Outdated Show resolved Hide resolved

ThomasHepworth reviewed Jul 6, 2023

View reviewed changes

splink/analyse_blocking.py Outdated Show resolved Hide resolved

Update splink/analyse_blocking.py

0b52e73

Co-authored-by: Tom Hepworth <45356472+ThomasHepworth@users.noreply.github.com>