[MAINTENANCE] Performance improvement refactor for Spark unexpected values #3368

NathanFarmer · 2021-09-08T15:52:35Z

Changes proposed in this pull request:

Allow Spark unexpected values to return first available instead of ordered list
Performance tests on expect_column_values_to_not_be_null with 4.5 GB of local filesystem data reduced total runtime by 40%
JIRA: DEVREL-154

Definition of Done

Please delete options that are not relevant.

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have run any local integration tests and made sure that nothing is broken.

…mn_map_condition_values where unexpected values no longer returns identical set of values every run.

netlify · 2021-09-08T15:52:39Z

✔️ Deploy Preview for niobium-lead-7998 ready!

🔨 Explore the source changes: 753e44d

🔍 Inspect the deploy log: https://app.netlify.com/sites/niobium-lead-7998/deploys/614a50b5928d5c00087a2a59

😎 Browse the preview: https://deploy-preview-3368--niobium-lead-7998.netlify.app

…ExecutionEngine_performance

github-actions · 2021-09-08T15:52:58Z

HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖

Please don't forget to add a clear and succinct description of your change under the Develop header in docs_rtd/changelog.rst, if applicable. This will help us with the release process. See the Contribution checklist in the Great Expectations documentation for the type of labels to use!

✨ Thank you! ✨

…ExecutionEngine_performance

…mn_map_condition_values where unexpected values no longer returns identical set of values every run (#3368).

…e_performance' of github.com:great-expectations/great_expectations into working-branch/DEVREL-154/improve_SparkDFExecutionEngine_performance

anthonyburdi

The code LGTM!

anthonyburdi

Approved after further discussion.

… sorting results (#3368).

…3368).

…ExecutionEngine_performance

NathanFarmer · 2021-09-17T18:36:39Z

tests/test_definitions/multicolumn_map_expectations/expect_compound_columns_to_be_unique.json

@@ -213,7 +213,7 @@
    },{
      "title": "unexpected_values_exact_match_out_without_unexpected_index_list",
      "exact_match_out" : true,
-      "suppress_test_for": ["pandas"],
+      "only_for": ["sqlalchemy"],


"exact_match_out" : true will no longer work for spark since we are no longer sorting results

…e_performance' of github.com:great-expectations/great_expectations into working-branch/DEVREL-154/improve_SparkDFExecutionEngine_performance

alexsherstinsky · 2021-09-17T19:36:52Z

great_expectations/expectations/metrics/map_metric_provider.py

@@ -2429,7 +2425,6 @@ def _spark_column_map_condition_value_counts(

    result_format = metric_value_kwargs["result_format"]

-    filtered = data.filter(F.col("__unexpected") == True).drop(F.col("__unexpected"))


@NathanFarmer I believe that we can leave the filtered = statement where it was (because there could be an exception raised earlier, so no need to have it before). Would you agree or not? Thanks.

@alexsherstinsky Sure, we can leave this line where it is, but then I would advocate to move lines 2408-2409 down below as well for consistency.

@NathanFarmer The way I am seeing it now seems good. Thank you.

alexsherstinsky · 2021-09-17T19:39:20Z

great_expectations/expectations/metrics/map_metric_provider.py

@@ -2585,7 +2562,7 @@ def _spark_multicolumn_map_condition_values(
 ):
    """Return values from the specified domain that match the map-style metric in the metrics dictionary."""
    (
-        boolean_mapped_unexpected_values,
+        unexpected_condition,


@NathanFarmer Could we place keep this variable named boolean_mapped_unexpected_values as it was before (I was following the previous style in such methods across all execution engines and would like to keep it consistent, unless you have a strong reason for changing it now). Thanks!

The reason for changing this was that metrics are not returning booleans in all cases. Cases that return a window function would fail for the 1-line solution filtered = df.filter(boolean_mapped_unexpected_values). The 2-line solution using withColumn creates a new boolean mapped column from this variable. You can see how I left it alone on line 2504 because it actually is a boolean mapping for all cases there.

@NathanFarmer I cannot find all the line numbers involved (the UI here is confusing, but I follow your logic). Thank you! P.S.: Should we standardize all cases to use unexpected_condition? Or do we want to preserve a special spot and variable name for the situations where it will always be strictly-boolean mapped? Thoughts welcome. Thank you.

@alexsherstinsky If we were to standardize all cases to use unexpected_condition that would future-proof metrics that return a window function. The tradeoff is that using boolean_mapped_unexpected_values with df.filter (1-line solution) is measurably faster where it is possible. I would propose that we leave boolean_mapped_unexpected_values as-is in places where it is not currently needed, and if a new metric requires it to be changed, it can be done at that time.

@NathanFarmer Perhaps we need to clean up our code? Can you please help us out here -- I see in Pandas, SQL, and Spark the pattern that in some cases defines unexpected_condition and in other cases defines boolean_mapped_unexpected_values as variable names, but ultimately these variables are used the same way: in a WHERE type clause. So are we simply misnaming this variable in one case for spark, because it is truly an (unexpected) condition and not merely boolean-mapped (unexpected) values? If this is correct, then please go ahead and fix as you deem appropriate. Thanks so much!

@alexsherstinsky It is correct that the metric always returns an unexpected_condition for Spark. So for semantic correctness, I have changed the one remaining use of Spark boolean_mapped_unexpected_values to unexpected_condition. In most Spark cases unexpected_condition can be passed directly to a WHERE clause, except in the case of a window function. I went ahead and changed:

data = df.filter(unexpected_condition)

to:

data = df.withColumn("__unexpected", unexpected_condition) filtered = data.filter(F.col("__unexpected") == True).drop(F.col("__unexpected"))

for consistency in this single case.

@NathanFarmer Thanks -- at least it is consistent, and we can revise later. We now need to solve the remaining problem: the equivalency of results, when the sort order is not enforced. If we can do this, then it is a big gain!

alexsherstinsky · 2021-09-17T19:46:37Z

great_expectations/self_check/util.py

@@ -1930,6 +1931,14 @@ def check_json_test_result(test, result, data_asset=None):
            elif key == "unexpected_list":
                # check if value can be sorted; if so, sort so arbitrary ordering of results does not cause failure
                if (isinstance(value, list)) & (len(value) >= 1):
+                    # dictionary handling isn't implemented in great_expectations.core.data_context_key.__lt__


@NathanFarmer I ❤️ the idea here; I have two questions about it:

Would it be better to follow up on your comment and update great_expectations.core.data_context_key.__lt_ or not?

If we must do this "consistent sort" here, is there a strong justification for using itemgetter instead of just lambda? If at all possible, I would prefer the lambda (for simplicity).

Thank you! I love how this idea absolves us from the need for the computationally-prohibitive row_number()!

I clarified the lt comment (it is actually looking at the built-in Python class that is passed). I also changed itemgetter to lambda.

alexsherstinsky · 2021-09-17T19:51:02Z

tests/test_definitions/multicolumn_map_expectations/expect_compound_columns_to_be_unique.json

@@ -213,7 +213,7 @@
    },{
      "title": "unexpected_values_exact_match_out_without_unexpected_index_list",
      "exact_match_out" : true,
-      "suppress_test_for": ["pandas"],
+      "only_for": ["sqlalchemy"],


@NathanFarmer I feel that we must make it (or an equivalent, additional test work for Spark). The existing test excludes Pandas, because unexpected_index_list only applies to Pandas; however, the functionality must work for the other execution engines. Since you introduced the "force-sorting" of both sides in the assertion in self_check/util.py, why would the existing test not work? To me, having this test is critical. Thank you!

In self_check/util.py#L1882 when test["exact_match_out"] is True all of the test sort logic is foregone. There is no analog to this test that would work that I can think of.

Note: There is a test above this one for pandas. It is possible if we sort the data in the test harness it will work for spark but I don't think even that is guaranteed.

@NathanFarmer I feel that we should discuss this in order to find a solution for testing the functionality for the Spark engine. The reasoning is that the addition of row_number() and sorting on it was key to making Spark work. However, I believe that you are pointing out that the sort order of the results is the only difference, the output being the same as expected otherwise. If this is the case, then we need to figure out how to show that in a test. If I am not understanding this correctly, then we might have to disable the expectation for Spark if it is qualitatively wrong without the use of row_number() (and incurring the corresponding performance penalty). Thank you!

@alexsherstinsky I was able to make changes to self_check/util.py to address this and reverted this test back to its original state. Please let me know your thoughts on the code, but the short version is we are now sorting unexpected_values and partial_unexpected_values in both cases where test["exact_match_out"] is True or False on Spark dataframes.

…#3368).

alexsherstinsky · 2021-09-18T00:14:14Z

great_expectations/expectations/metrics/map_metric_provider.py

-        .drop(F.col("__unexpected"))
-        .drop(F.col("__row_number"))
-    )
+    filtered = df.filter(boolean_mapped_unexpected_values)


@NathanFarmer Can we please preserve the pattern for having data = df.withColumn("__unexpected", boolean_mapped_unexpected_values) first -- and then followed by the filter on F.col("__unexpected") -- for consistency and readability purposes (plus for potential extensibility needs). Thank you!

See my other comments about that pattern being measurably slower. We can discuss the tradeoffs in our face-to-face testing discussion.

alexsherstinsky · 2021-09-18T00:20:30Z

great_expectations/self_check/util.py

@@ -1930,7 +1930,19 @@ def check_json_test_result(test, result, data_asset=None):
            elif key == "unexpected_list":
                # check if value can be sorted; if so, sort so arbitrary ordering of results does not cause failure
                if (isinstance(value, list)) & (len(value) >= 1):
-                    if type(value[0].__lt__(value[0])) != type(NotImplemented):
+                    # __lt__ is not implemented for python dictionaries making sorting trickier


alexsherstinsky

@NathanFarmer I added some more comments and would propose that you and I discuss. I think that there is only one open issue remaining, albeit an important one. Thank you!

… unexpected_condition for semantic correctness and transform using withColumn for consistency (#3368).

…sting (#3368).

…pected_values for Spark testing (#3368).

…3368).

…ExecutionEngine_performance

alexsherstinsky · 2021-09-21T18:33:55Z

great_expectations/self_check/util.py

@@ -1879,7 +1903,32 @@ def evaluate_json_test_cfe(validator, expectation_type, test):

 def check_json_test_result(test, result, data_asset=None):
    # Check results
-    if test["exact_match_out"] is True:
+    # For Spark we cannot guarantee the order in which values are returned, so we sort for testing purposes


@NathanFarmer Naive thought: Why not sort for all backends? I do not see detriment in pre-sorting in a well-defined order these unexpected values for all backends indiscriminately. What do you think? Thanks!

alexsherstinsky · 2021-09-21T18:37:27Z

tests/test_definitions/multicolumn_map_expectations/expect_compound_columns_to_be_unique.json

@@ -213,7 +213,7 @@
    },{
      "title": "unexpected_values_exact_match_out_without_unexpected_index_list",
      "exact_match_out" : true,
-      "only_for": ["sqlalchemy"],
+      "suppress_test_for": ["pandas"],


@NathanFarmer Unfortunately, GitHub is still showing me the apparently incorrectly changed code, so reviewing the affected modules via screen sharing with the focus on what had to be changed would be great. Thanks!

…her exact_match_out True or False (#3368).

…e_performance' of github.com:great-expectations/great_expectations into working-branch/DEVREL-154/improve_SparkDFExecutionEngine_performance

…ExecutionEngine_performance

alexsherstinsky

LGTM -- Thank you very much!

[MAINTENANCE] Performance improvement refactor for helper _spark_colu…

ec72b07

…mn_map_condition_values where unexpected values no longer returns identical set of values every run.

Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…

5ad62bc

…ExecutionEngine_performance

NathanFarmer mentioned this pull request Sep 8, 2021

Improve SparkDFExecutionEngine Performance #3303

Closed

petermoyer added the core-team label Sep 9, 2021

Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…

778051e

…ExecutionEngine_performance

NathanFarmer requested review from alexsherstinsky, jcampbell and anthonyburdi September 16, 2021 15:40

NathanFarmer self-assigned this Sep 16, 2021

Nathan Farmer added 3 commits September 16, 2021 11:46

[MAINTENANCE] Performance improvement refactor for helper _spark_colu…

4473092

…mn_map_condition_values where unexpected values no longer returns identical set of values every run (#3368).

Merge branch 'working-branch/DEVREL-154/improve_SparkDFExecutionEngin…

a784897

…e_performance' of github.com:great-expectations/great_expectations into working-branch/DEVREL-154/improve_SparkDFExecutionEngine_performance

Change log

66b5ae3

NathanFarmer marked this pull request as ready for review September 16, 2021 17:01

anthonyburdi reviewed Sep 16, 2021

View reviewed changes

anthonyburdi approved these changes Sep 16, 2021

View reviewed changes

Nathan Farmer and others added 4 commits September 17, 2021 14:07

[MAINTENANCE] This test no longer applies to spark because we stopped…

21eea41

… sorting results (#3368).

[MAINTENANCE] Remove all sorting logic from spark provider helpers (#…

a39aa19

…3368).

[MAINTENANCE] Sort dictionaries in tests for comparisons (#3368).

abf47b8

Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…

a276eab

…ExecutionEngine_performance

NathanFarmer commented Sep 17, 2021

View reviewed changes

Nathan Farmer added 2 commits September 17, 2021 14:41

Linting

fea4e71

Merge branch 'working-branch/DEVREL-154/improve_SparkDFExecutionEngin…

bd8fc07

…e_performance' of github.com:great-expectations/great_expectations into working-branch/DEVREL-154/improve_SparkDFExecutionEngine_performance

NathanFarmer requested a review from anthonyburdi September 17, 2021 18:55

alexsherstinsky reviewed Sep 17, 2021

View reviewed changes

Clean up

8050ce9

alexsherstinsky reviewed Sep 17, 2021

View reviewed changes

Nathan Farmer added 2 commits September 17, 2021 17:26

[MAINTENANCE] Lambda instead of itemgetter for consistency/simplicity (…

dd56b24

…#3368).

Linting

ce2105e

NathanFarmer requested a review from alexsherstinsky September 17, 2021 21:32

Nathan Farmer added 2 commits September 17, 2021 17:35

Accidentally re-used variable name

8b48961

Linting

2e2ff25

alexsherstinsky reviewed Sep 18, 2021

View reviewed changes

alexsherstinsky suggested changes Sep 18, 2021

View reviewed changes

Nathan Farmer and others added 9 commits September 21, 2021 10:26

[MAINTENANCE] Change final use of boolean_mapped_unexpected_values to…

7a72aaa

… unexpected_condition for semantic correctness and transform using withColumn for consistency (#3368).

[MAINTENANCE] Helper function for sorting unexpected_values during te…

19f3bd4

…sting (#3368).

[MAINTENANCE] When exact_match_out is True we still need to sort unex…

7adf3b9

…pected_values for Spark testing (#3368).

[MAINTENANCE] Moved sort logic into helper function (#3368).

b9bb7cf

Cleanup

099802a

[MAINTENANCE] Sort should also be applied to partial_unexpected_list (#…

1f9d8d3

…3368).

[MAINTENANCE] Revert broken test back to its original state (#3368).

93b5d25

Linting

80074e9

Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…

e1ddfee

…ExecutionEngine_performance

NathanFarmer requested a review from alexsherstinsky September 21, 2021 18:10

alexsherstinsky reviewed Sep 21, 2021

View reviewed changes

alexsherstinsky self-requested a review September 21, 2021 18:38

Nathan Farmer and others added 4 commits September 21, 2021 16:37

[MAINTENANCE] Consolidate sorting to make it clear that we do it whet…

341b769

…her exact_match_out True or False (#3368).

Merge branch 'working-branch/DEVREL-154/improve_SparkDFExecutionEngin…

d338d03

…e_performance' of github.com:great-expectations/great_expectations into working-branch/DEVREL-154/improve_SparkDFExecutionEngine_performance

Linting

51f2005

Merge branch 'develop' into working-branch/DEVREL-154/improve_SparkDF…

753e44d

…ExecutionEngine_performance

NathanFarmer enabled auto-merge (squash) September 21, 2021 22:42

alexsherstinsky approved these changes Sep 21, 2021

View reviewed changes

NathanFarmer merged commit f4ba4ed into develop Sep 21, 2021

NathanFarmer deleted the working-branch/DEVREL-154/improve_SparkDFExecutionEngine_performance branch September 21, 2021 22:53

NathanFarmer added the devrel This item is being addressed by the Developer Relations Team label Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAINTENANCE] Performance improvement refactor for Spark unexpected values #3368

[MAINTENANCE] Performance improvement refactor for Spark unexpected values #3368

NathanFarmer commented Sep 8, 2021 •

edited

netlify bot commented Sep 8, 2021 •

edited

github-actions bot commented Sep 8, 2021

anthonyburdi left a comment

anthonyburdi left a comment

NathanFarmer Sep 17, 2021

alexsherstinsky Sep 17, 2021

NathanFarmer Sep 17, 2021

alexsherstinsky Sep 18, 2021

alexsherstinsky Sep 17, 2021

NathanFarmer Sep 17, 2021

alexsherstinsky Sep 18, 2021

NathanFarmer Sep 21, 2021

alexsherstinsky Sep 21, 2021

NathanFarmer Sep 21, 2021

alexsherstinsky Sep 21, 2021

alexsherstinsky Sep 17, 2021

NathanFarmer Sep 17, 2021

alexsherstinsky Sep 17, 2021

NathanFarmer Sep 17, 2021

NathanFarmer Sep 17, 2021

alexsherstinsky Sep 18, 2021

NathanFarmer Sep 21, 2021 •

edited

alexsherstinsky Sep 18, 2021

NathanFarmer Sep 21, 2021

alexsherstinsky Sep 18, 2021

alexsherstinsky left a comment

alexsherstinsky Sep 21, 2021

alexsherstinsky Sep 21, 2021

alexsherstinsky left a comment

		@@ -2429,7 +2425,6 @@ def _spark_column_map_condition_value_counts(

		result_format = metric_value_kwargs["result_format"]

		filtered = data.filter(F.col("__unexpected") == True).drop(F.col("__unexpected"))

[MAINTENANCE] Performance improvement refactor for Spark unexpected values #3368

[MAINTENANCE] Performance improvement refactor for Spark unexpected values #3368

Conversation

NathanFarmer commented Sep 8, 2021 • edited

Definition of Done

netlify bot commented Sep 8, 2021 • edited

github-actions bot commented Sep 8, 2021

HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖

anthonyburdi left a comment

Choose a reason for hiding this comment

anthonyburdi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanFarmer Sep 21, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexsherstinsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexsherstinsky left a comment

Choose a reason for hiding this comment

NathanFarmer commented Sep 8, 2021 •

edited

netlify bot commented Sep 8, 2021 •

edited

NathanFarmer Sep 21, 2021 •

edited