Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Performance improvement refactor for Spark unexpected values #3368

Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
ec72b07
[MAINTENANCE] Performance improvement refactor for helper _spark_colu…
Sep 8, 2021
5ad62bc
Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…
NathanFarmer Sep 8, 2021
778051e
Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…
NathanFarmer Sep 16, 2021
4473092
[MAINTENANCE] Performance improvement refactor for helper _spark_colu…
Sep 16, 2021
a784897
Merge branch 'working-branch/DEVREL-154/improve_SparkDFExecutionEngin…
Sep 16, 2021
66b5ae3
Change log
Sep 16, 2021
21eea41
[MAINTENANCE] This test no longer applies to spark because we stopped…
Sep 17, 2021
a39aa19
[MAINTENANCE] Remove all sorting logic from spark provider helpers (#…
Sep 17, 2021
abf47b8
[MAINTENANCE] Sort dictionaries in tests for comparisons (#3368).
Sep 17, 2021
a276eab
Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…
NathanFarmer Sep 17, 2021
fea4e71
Linting
Sep 17, 2021
bd8fc07
Merge branch 'working-branch/DEVREL-154/improve_SparkDFExecutionEngin…
Sep 17, 2021
8050ce9
Clean up
Sep 17, 2021
522a3f3
[MAINTENANCE] Incorrect source of __lt__ in comment (#3368).
Sep 17, 2021
fcb22b8
[MAINTENANCE] Clarify how sorting works for each data type (#3368).
Sep 17, 2021
dd56b24
[MAINTENANCE] Lambda instead of itemgetter for consistency/simplicity…
Sep 17, 2021
ce2105e
Linting
Sep 17, 2021
8b48961
Accidentally re-used variable name
Sep 17, 2021
2e2ff25
Linting
Sep 17, 2021
7a72aaa
[MAINTENANCE] Change final use of boolean_mapped_unexpected_values to…
Sep 21, 2021
19f3bd4
[MAINTENANCE] Helper function for sorting unexpected_values during te…
Sep 21, 2021
7adf3b9
[MAINTENANCE] When exact_match_out is True we still need to sort unex…
Sep 21, 2021
b9bb7cf
[MAINTENANCE] Moved sort logic into helper function (#3368).
Sep 21, 2021
099802a
Cleanup
Sep 21, 2021
1f9d8d3
[MAINTENANCE] Sort should also be applied to partial_unexpected_list …
Sep 21, 2021
93b5d25
[MAINTENANCE] Revert broken test back to its original state (#3368).
Sep 21, 2021
80074e9
Linting
Sep 21, 2021
e1ddfee
Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…
NathanFarmer Sep 21, 2021
341b769
[MAINTENANCE] Consolidate sorting to make it clear that we do it whet…
Sep 21, 2021
d338d03
Merge branch 'working-branch/DEVREL-154/improve_SparkDFExecutionEngin…
Sep 21, 2021
51f2005
Linting
Sep 21, 2021
753e44d
Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…
NathanFarmer Sep 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs_rtd/changelog.rst
Expand Up @@ -9,6 +9,7 @@ develop
-----------------
* [FEATURE] Configurable multi-threaded checkpoint speedup (#3362)
* [DOCS] "Deploying Great Expectations in a hosted environment without file system or CLI" (#3361)
* [MAINTENANCE] Spark performance improvement for metrics that return unexpected values (#3368)

0.13.33
-----------------
Expand Down
15 changes: 4 additions & 11 deletions great_expectations/expectations/metrics/map_metric_provider.py
Expand Up @@ -2373,17 +2373,10 @@ def _spark_column_map_condition_values(
message=f'Error: The column "{column_name}" in BatchData does not exist.'
)

data = (
df.withColumn("__row_number", F.row_number().over(Window.orderBy(F.lit(1))))
.withColumn("__unexpected", unexpected_condition)
.orderBy(F.col("__row_number"))
)

filtered = (
data.filter(F.col("__unexpected") == True)
.drop(F.col("__unexpected"))
.drop(F.col("__row_number"))
)
# withColumn is required to transform windowFunctions returned by some metrics to boolean mask
# e.g. increasing, decreasing, unique
data = df.withColumn("__unexpected", unexpected_condition)
filtered = data.filter(F.col("__unexpected") == True).drop(F.col("__unexpected"))

result_format = metric_value_kwargs["result_format"]
if result_format["result_format"] == "COMPLETE":
Expand Down