Smooth Validation and Profiling Performance #619

jcampbell · 2019-08-15T22:17:38Z

This PR addresses issues related to performance and ease of profiling larger datasets.

UDFs are removed from sparkdf_dataset where possible
Logging is more terse in cases where we correctly identify a problem during evaluating cardinality.

In addition, this PR changes validation by default to detect column-type expectations and group them. This is nearly zero-cost in all circumstances and can speed up cases where iterating over the same column has a cache benefit. It also demonstrates how using a ValidationOperator may affect validation order or other characteristics.

action before refactor, with associated tests.

coveralls · 2019-08-16T12:09:31Z

Coverage increased (+0.006%) to 81.437% when pulling f103be9 on feature/spark_optimization into 5852b5d on develop.

abegong

👍

abegong · 2019-08-16T23:41:47Z

great_expectations/dataset/sparkdf_dataset.py

-        return column.withColumn('__success', success_udf(column[0]))
+        if None in value_set:
+            # spark isin returns None when any value is compared to None
+            logger.error("expect_column_values_to_be_in_set cannot support a None in the value_set in spark")


such nice error messages

abegong · 2019-08-16T23:43:59Z

tests/test_data_asset.py

            {'expectation_type': 'expect_column_values_to_be_in_set',
             'kwargs': {'column': 'D', 'value_set': ['e', 'f', 'g', 'h']}}
        ]

        sub1 = df[:3]

        sub1.discard_failing_expectations()
-        self.assertEqual(sub1.find_expectations(), exp1)
+        # PY2 sorting is allowed and order not guaranteed


Feels like sorting is a function that ExpectationSuite.get_config (or similar) should eventually handle.

jcampbell added 6 commits August 15, 2019 09:25

Use error logger instead of exception logger on profiler.

77fbea9

Add column-order validation to demonstrate validationoperator style

ab9637b

action before refactor, with associated tests.

Catch minor typo

dd46704

Refactor set membership to not use UDF

2ca386c

Refactor regex expectation to avoid UDF

ca16fbe

Improve commenting for None in spark isin

23a0590

eugmandel previously approved these changes Aug 15, 2019

View reviewed changes

eugmandel and others added 3 commits August 15, 2019 16:34

Merge branch 'develop' into feature/spark_optimization

62b7c44

Merge branch 'develop' into feature/spark_optimization

b924762

Fix rlike import

6bcb411

jcampbell dismissed eugmandel’s stale review via 6bcb411 August 16, 2019 11:41

Make PY2 tests sort lists of expectations first

f103be9

jcampbell force-pushed the feature/spark_optimization branch from 66bff11 to f103be9 Compare August 16, 2019 13:06

abegong approved these changes Aug 16, 2019

View reviewed changes

jcampbell merged commit 6787796 into develop Aug 16, 2019

jcampbell deleted the feature/spark_optimization branch August 16, 2019 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smooth Validation and Profiling Performance #619

Smooth Validation and Profiling Performance #619

jcampbell commented Aug 15, 2019

coveralls commented Aug 16, 2019 •

edited

Loading

abegong left a comment

abegong Aug 16, 2019

abegong Aug 16, 2019

jcampbell Aug 16, 2019

Smooth Validation and Profiling Performance #619

Smooth Validation and Profiling Performance #619

Conversation

jcampbell commented Aug 15, 2019

coveralls commented Aug 16, 2019 • edited Loading

abegong left a comment

Choose a reason for hiding this comment

abegong Aug 16, 2019

Choose a reason for hiding this comment

abegong Aug 16, 2019

Choose a reason for hiding this comment

jcampbell Aug 16, 2019

Choose a reason for hiding this comment

coveralls commented Aug 16, 2019 •

edited

Loading