Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Fix conditional for pyspark compatibility #8108

Merged
merged 9 commits into from Jun 14, 2023

Conversation

anthonyburdi
Copy link
Member

@anthonyburdi anthonyburdi commented Jun 13, 2023

Looks like we were checking the compatibility module instead of the actual pyspark import. The issue is that the truthyness of the module will not change whether pyspark is installed or not, which is what we are trying to test in the if pyspark:.

@netlify
Copy link

netlify bot commented Jun 13, 2023

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit 214cb12
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/6489e5e0375e3400087d5a2b

@anthonyburdi anthonyburdi self-assigned this Jun 13, 2023
@anthonyburdi anthonyburdi changed the title [BUGFIX] import dataframe directly [BUGFIX] Import dataframe directly Jun 13, 2023
@ghost
Copy link

ghost commented Jun 13, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@anthonyburdi anthonyburdi enabled auto-merge (squash) June 13, 2023 19:00
Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -690,9 +690,9 @@ def head(self, *args, **kwargs) -> pd.DataFrame:
return pd.DataFrame({})


if pyspark:
if pyspark_DataFrame: # type: ignore[truthy-function]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyburdi This does not sound right -- per your own idea, I thought that we could actually make pyspark.DataFrame the way of passing the DataFrame (and I got to appreciate that idea!). Happy to discuss. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alexsherstinsky, the issue with the code as-is was that the if pyspark: as written was checking truthyness of the compatibility/pyspark.py module and not whether pyspark was installed. We could have changed it to if pyspark.pyspark: but thought that since only the dataframe is imported that this might be cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyburdi You got me to believe that using the pattern if pyspark.pyspark or if pyspark.DataFrame would work -- and it did! So I utilized it uniformly; the benefit was full scoping (no need to create special variables). So I would vote for that approach, because if that pyspark.DataFrame does not exist, the SPARK_NOT_IMPORTED will produce the relevant error message. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense @alexsherstinsky, the issue though here is that the if statement was effectively not checking for SPARK_NOT_IMPORTED but instead for the existence of the pyspark compatibility module. That is a real bug and what this PR fixes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsherstinsky fixed, thanks for the discussion

Copy link
Contributor

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyburdi I do not yet see this change to be needed. Happy to discuss. Thanks!

@@ -690,9 +690,9 @@ def head(self, *args, **kwargs) -> pd.DataFrame:
return pd.DataFrame({})


if pyspark:
if pyspark_DataFrame: # type: ignore[truthy-function]
Copy link
Member

@Kilo59 Kilo59 Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyburdi can you add a comment to this type ignore explaining why the truthy-function is "okay"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

@anthonyburdi anthonyburdi changed the title [BUGFIX] Import dataframe directly [BUGFIX] Fix conditional for pyspark compatibility Jun 14, 2023
Copy link
Contributor

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@anthonyburdi anthonyburdi merged commit b7d2060 into develop Jun 14, 2023
43 checks passed
@anthonyburdi anthonyburdi deleted the b/_/fix_pyspark_import branch June 14, 2023 17:15
Shinnnyshinshin added a commit that referenced this pull request Jun 14, 2023
…m/great-expectations/great_expectations into m/_/sqlalchemy2-pandas2-follow-up

* 'm/_/sqlalchemy2-pandas2-follow-up' of https://github.com/great-expectations/great_expectations:
  [BUGFIX] Robust Handling Of Column Types And Empty DataFrames For DataBricks/Spark Environment (#8115)
  [BUGFIX] respect result format bool only for validators and checkpoints (#8111)
  [MAINTENANCE] Update build_in_memory_runtime_context to accept which datasources to include (#8017)
  [BUGFIX] Import dataframe directly (#8108)
  [MAINTENANCE] Cleanup generate_expectation_tests (#8019)
  [MAINTENANCE] Ensure that new usage statistics schema changes are backwards compatible (#8109)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants