New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Fix conditional for pyspark compatibility #8108
Conversation
✅ Deploy Preview for niobium-lead-7998 canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
great_expectations/core/batch.py
Outdated
@@ -690,9 +690,9 @@ def head(self, *args, **kwargs) -> pd.DataFrame: | |||
return pd.DataFrame({}) | |||
|
|||
|
|||
if pyspark: | |||
if pyspark_DataFrame: # type: ignore[truthy-function] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anthonyburdi This does not sound right -- per your own idea, I thought that we could actually make pyspark.DataFrame
the way of passing the DataFrame
(and I got to appreciate that idea!). Happy to discuss. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alexsherstinsky, the issue with the code as-is was that the if pyspark:
as written was checking truthyness of the compatibility/pyspark.py module and not whether pyspark was installed. We could have changed it to if pyspark.pyspark:
but thought that since only the dataframe is imported that this might be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anthonyburdi You got me to believe that using the pattern if pyspark.pyspark
or if pyspark.DataFrame
would work -- and it did! So I utilized it uniformly; the benefit was full scoping (no need to create special variables). So I would vote for that approach, because if that pyspark.DataFrame does not exist, the SPARK_NOT_IMPORTED
will produce the relevant error message. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense @alexsherstinsky, the issue though here is that the if
statement was effectively not checking for SPARK_NOT_IMPORTED
but instead for the existence of the pyspark
compatibility module. That is a real bug and what this PR fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexsherstinsky fixed, thanks for the discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anthonyburdi I do not yet see this change to be needed. Happy to discuss. Thanks!
great_expectations/core/batch.py
Outdated
@@ -690,9 +690,9 @@ def head(self, *args, **kwargs) -> pd.DataFrame: | |||
return pd.DataFrame({}) | |||
|
|||
|
|||
if pyspark: | |||
if pyspark_DataFrame: # type: ignore[truthy-function] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anthonyburdi can you add a comment to this type ignore explaining why the truthy-function
is "okay"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
…m/great-expectations/great_expectations into m/_/sqlalchemy2-pandas2-follow-up * 'm/_/sqlalchemy2-pandas2-follow-up' of https://github.com/great-expectations/great_expectations: [BUGFIX] Robust Handling Of Column Types And Empty DataFrames For DataBricks/Spark Environment (#8115) [BUGFIX] respect result format bool only for validators and checkpoints (#8111) [MAINTENANCE] Update build_in_memory_runtime_context to accept which datasources to include (#8017) [BUGFIX] Import dataframe directly (#8108) [MAINTENANCE] Cleanup generate_expectation_tests (#8019) [MAINTENANCE] Ensure that new usage statistics schema changes are backwards compatible (#8109)
Looks like we were checking the compatibility module instead of the actual pyspark import. The issue is that the truthyness of the module will not change whether pyspark is installed or not, which is what we are trying to test in the
if pyspark:
.