-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of link only sample test #1773
Conversation
@@ -76,19 +96,12 @@ def estimate_u_values(linker: Linker, max_pairs, seed=None): | |||
result = dataframe.as_record_dict() | |||
dataframe.drop_table_from_database_and_remove_from_cache() | |||
frame_counts = [res["count"] for res in result] | |||
# total valid links is sum of pairwise product of individual row counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this code to function so it could be tested individually
linker = helper.Linker( | ||
[df_l, df_r], | ||
settings, | ||
input_table_aliases=["_a", "_b"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may make it slightly faster
|
||
# max_pairs is a good deal less than total possible pairs = 9_000_000 | ||
max_pairs = 1_800_000 | ||
|
||
settings = { | ||
"link_type": "link_only", | ||
"comparisons": [helper.cl.levenshtein_at_thresholds("name", 2)], | ||
"comparisons": [helper.cl.exact_match("name")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for a CPU-heavy comparison to test this functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, think this looks sensible.
Happy with skipping that test for spark, especially with the additional cover - I think we can always add further testing if we need to go in and change the relevant sampling code
tests/test_u_train.py
Outdated
helper.extra_linker_args() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line here by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good spot!
Now we've fixed the analyse blocking test, this is now the slowest test
Combining a
spark
test withdebug_mode=True
is very slow, because lots of intermediate tables are produced and output to parquet. (I've also tried setting the checkpointing method topersist
but that didn't help either)In this PR, I've disabled the spark test, and included a new test for the underlying calculation
https://mojdt.slack.com/archives/C060KLFJXPX/p1699978472369589
Here's some previous notes coped from slack:
I have been experimenting with this test (the second slowest).
The problem is I can’t find an easy to to make it faster
It would be possible to refactor the test to make it faster, but at a significant loss of clarity as to what the test is actually doing
reducing size of data makes the test close to useless. needs big sample because of the random element.
running debug_mode = True is really bad for Spark performance, but necessary to test the high-level API
possibly there’s a solution with a seed, but not sure how robust, and only works in spark and duckdb
I wonder if the best solution is simply to disable Spark for this test. On the basis that it’s really testing the random sample size is computed correctly - and if that’s right in Duckdb, then it should be right in Spark. (realise that isn’t completely watertight, but if it’s wrong other things should break too).