Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of link only sample test #1773

Merged
merged 9 commits into from
Nov 30, 2023
Merged

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Nov 28, 2023

Now we've fixed the analyse blocking test, this is now the slowest test

Combining a spark test with debug_mode=True is very slow, because lots of intermediate tables are produced and output to parquet. (I've also tried setting the checkpointing method to persist but that didn't help either)

In this PR, I've disabled the spark test, and included a new test for the underlying calculation
https://mojdt.slack.com/archives/C060KLFJXPX/p1699978472369589

Here's some previous notes coped from slack:

I have been experimenting with this test (the second slowest).

The problem is I can’t find an easy to to make it faster

  • It would be possible to refactor the test to make it faster, but at a significant loss of clarity as to what the test is actually doing

  • reducing size of data makes the test close to useless. needs big sample because of the random element.

  • running debug_mode = True is really bad for Spark performance, but necessary to test the high-level API

  • possibly there’s a solution with a seed, but not sure how robust, and only works in spark and duckdb

I wonder if the best solution is simply to disable Spark for this test. On the basis that it’s really testing the random sample size is computed correctly - and if that’s right in Duckdb, then it should be right in Spark. (realise that isn’t completely watertight, but if it’s wrong other things should break too).

@RobinL RobinL requested a review from ADBond November 28, 2023 09:32
@@ -76,19 +96,12 @@ def estimate_u_values(linker: Linker, max_pairs, seed=None):
result = dataframe.as_record_dict()
dataframe.drop_table_from_database_and_remove_from_cache()
frame_counts = [res["count"] for res in result]
# total valid links is sum of pairwise product of individual row counts
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this code to function so it could be tested individually

linker = helper.Linker(
[df_l, df_r],
settings,
input_table_aliases=["_a", "_b"],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may make it slightly faster


# max_pairs is a good deal less than total possible pairs = 9_000_000
max_pairs = 1_800_000

settings = {
"link_type": "link_only",
"comparisons": [helper.cl.levenshtein_at_thresholds("name", 2)],
"comparisons": [helper.cl.exact_match("name")],
Copy link
Member Author

@RobinL RobinL Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for a CPU-heavy comparison to test this functionality

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, think this looks sensible.
Happy with skipping that test for spark, especially with the additional cover - I think we can always add further testing if we need to go in and change the relevant sampling code

Comment on lines 30 to 31
helper.extra_linker_args()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line here by mistake?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good spot!

@RobinL RobinL merged commit aed409c into master Nov 30, 2023
10 checks passed
@RobinL RobinL deleted the faster_link_only_sample branch November 30, 2023 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants