Improve speed of link only sample test #1773

RobinL · 2023-11-28T09:19:49Z

Now we've fixed the analyse blocking test, this is now the slowest test

Combining a spark test with debug_mode=True is very slow, because lots of intermediate tables are produced and output to parquet. (I've also tried setting the checkpointing method to persist but that didn't help either)

In this PR, I've disabled the spark test, and included a new test for the underlying calculation
https://mojdt.slack.com/archives/C060KLFJXPX/p1699978472369589

Here's some previous notes coped from slack:

I have been experimenting with this test (the second slowest).

The problem is I can’t find an easy to to make it faster

It would be possible to refactor the test to make it faster, but at a significant loss of clarity as to what the test is actually doing
reducing size of data makes the test close to useless. needs big sample because of the random element.
running debug_mode = True is really bad for Spark performance, but necessary to test the high-level API
possibly there’s a solution with a seed, but not sure how robust, and only works in spark and duckdb

I wonder if the best solution is simply to disable Spark for this test. On the basis that it’s really testing the random sample size is computed correctly - and if that’s right in Duckdb, then it should be right in Spark. (realise that isn’t completely watertight, but if it’s wrong other things should break too).

…sted

RobinL · 2023-11-30T12:58:56Z

splink/estimate_u.py

@@ -76,19 +96,12 @@ def estimate_u_values(linker: Linker, max_pairs, seed=None):
        result = dataframe.as_record_dict()
        dataframe.drop_table_from_database_and_remove_from_cache()
        frame_counts = [res["count"] for res in result]
-        # total valid links is sum of pairwise product of individual row counts


Moved this code to function so it could be tested individually

RobinL · 2023-11-30T12:59:16Z

tests/test_u_train.py

+    linker = helper.Linker(
+        [df_l, df_r],
+        settings,
+        input_table_aliases=["_a", "_b"],


this may make it slightly faster

RobinL · 2023-11-30T12:59:32Z

tests/test_u_train.py


    # max_pairs is a good deal less than total possible pairs = 9_000_000
    max_pairs = 1_800_000

    settings = {
        "link_type": "link_only",
-        "comparisons": [helper.cl.levenshtein_at_thresholds("name", 2)],
+        "comparisons": [helper.cl.exact_match("name")],


no need for a CPU-heavy comparison to test this functionality

ADBond

Yep, think this looks sensible.
Happy with skipping that test for spark, especially with the additional cover - I think we can always add further testing if we need to go in and change the relevant sampling code

ADBond · 2023-11-30T15:58:26Z

tests/test_u_train.py

+    helper.extra_linker_args()
+


is this line here by mistake?

yes, good spot!

RobinL added 4 commits November 28, 2023 09:17

spped up link only tests

2677d80

time test durations

dc4c492

lint

cd11093

remove seed, unsupported for postgres

1e6abfd

RobinL requested a review from ADBond November 28, 2023 09:32

RobinL added 4 commits November 30, 2023 09:49

improve test so it's not just repeating the same code that's being te…

7f099d1

…sted

lint

c3ecb45

Merge branch 'master' into faster_link_only_sample

ccf8fdb

undo formatting to improve diff

2ea054f

RobinL commented Nov 30, 2023

View reviewed changes

ADBond approved these changes Nov 30, 2023

View reviewed changes

remove accidental code

64541a9

RobinL merged commit aed409c into master Nov 30, 2023
10 checks passed

RobinL deleted the faster_link_only_sample branch November 30, 2023 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of link only sample test #1773

Improve speed of link only sample test #1773

RobinL commented Nov 28, 2023 •

edited

Loading

RobinL Nov 30, 2023

RobinL Nov 30, 2023

RobinL Nov 30, 2023 •

edited

Loading

ADBond left a comment

ADBond Nov 30, 2023

RobinL Nov 30, 2023

Improve speed of link only sample test #1773

Improve speed of link only sample test #1773

Conversation

RobinL commented Nov 28, 2023 • edited Loading

RobinL Nov 30, 2023

Choose a reason for hiding this comment

RobinL Nov 30, 2023

Choose a reason for hiding this comment

RobinL Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

ADBond left a comment

Choose a reason for hiding this comment

ADBond Nov 30, 2023

Choose a reason for hiding this comment

RobinL Nov 30, 2023

Choose a reason for hiding this comment

RobinL commented Nov 28, 2023 •

edited

Loading

RobinL Nov 30, 2023 •

edited

Loading