Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Allowing schema to be passed in as batch_spec_passthrough in Spark #5900

Conversation

Shinnnyshinshin
Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin commented Aug 31, 2022

Changes proposed in this pull request:

  • Follow-up to [MAINTENANCE] Adding serialization tests for Spark #5897
  • Schemas in Spark Dataframes are defined as StructType, which is not serializable.
  • This PR adds the schema.jsonValue() call which translates the object into a json, whenever a schema is passed in as a batch_spec_passthrough parameter
  • schema can be passed in at the Asset-level and DataConnector-level.
  • schema can also be passed in as part of BatchRequest that is part of a Checkpoint.
  • The PR adds the necessary changes to prepare_dump and convert_to_json_serializable() and adds integration tests to test_serialization.py.
  • Also includes changes from [FEATURE] SparkDFExecutionEngine able to use schema #5917

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

@netlify
Copy link

netlify bot commented Aug 31, 2022

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit e61e9e4
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/6311685ad1c448000807d676
😎 Deploy Preview https://deploy-preview-5900--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@Shinnnyshinshin Shinnnyshinshin self-assigned this Aug 31, 2022
@ghost
Copy link

ghost commented Aug 31, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

Comment on lines +281 to +282
@pre_dump
def prepare_dump(self, data, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_spec_passthrough can be at the Asset-level

Comment on lines +764 to +765
@pre_dump
def prepare_dump(self, data, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_spec_passthrough can be at the DataConnector-level

@Shinnnyshinshin Shinnnyshinshin force-pushed the f/GREAT-465/GREAT-1204/adding-serialization-for-schema-in-spark branch from 67a0f5e to 22f5c01 Compare September 1, 2022 00:09
@Shinnnyshinshin Shinnnyshinshin changed the title [FEATURE] Serialization for schema in Spark [FEATURE] Allowing schema to be passed in as batch_spec_passthrough in Spark Sep 1, 2022
Comment on lines +713 to +716
@pytest.mark.parametrize(
"checkpoint_config,expected_serialized_checkpoint_config",
[
pytest.param(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same tests are now parameterized :)

Comment on lines -772 to -778
@pytest.mark.integration
def test_serialization_of_datasource_with_nested_objects_spark(spark_session):
datasource_config: DatasourceConfig = DatasourceConfig(
name="taxi_data",
class_name="Datasource",
module_name="great_expectations.datasource",
execution_engine=ExecutionEngineConfig(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same tests are now parameterized :)

Will Shin added 3 commits August 31, 2022 17:31
…in-spark' of https://github.com/great-expectations/great_expectations into f/GREAT-465/GREAT-1204/adding-serialization-for-schema-in-spark

* 'f/GREAT-465/GREAT-1204/adding-serialization-for-schema-in-spark' of https://github.com/great-expectations/great_expectations:
  [BUGFIX] Patch issue with `checkpoint_identifier` within `Checkpoint.run` workflow (#5894)
  [MAINTENANCE] Adding serialization tests for Spark (#5897)
  [MAINTENANCE] Add slow pytest marker to config and sort them alphabetically. (#5892)
This reverts commit 41dd255.
Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! LGTM!

Copy link
Contributor

@NathanFarmer NathanFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

if schema and isinstance(schema, StructType):
data["batch_spec_passthrough"]["reader_options"][
"schema"
] = schema.jsonValue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment on lines 945 to 953
"schema": StructType(
[
StructField(
"a", IntegerType(), True, None
),
StructField(
"b", IntegerType(), True, None
),
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is leveling up huge!!

William Shin added 2 commits September 1, 2022 08:43
…on-for-schema-in-spark

* develop:
  [MAINTENANCE] Remove xfails from passing tests in preparation for 0.15.21 release (#5908)
Will Shin added 2 commits September 1, 2022 10:50
…on-for-schema-in-spark

* develop:
  [DOCS] DOC-368 spelling correction (#5912)
@Shinnnyshinshin Shinnnyshinshin enabled auto-merge (squash) September 1, 2022 19:06
@Shinnnyshinshin Shinnnyshinshin enabled auto-merge (squash) September 2, 2022 02:20
@Shinnnyshinshin Shinnnyshinshin merged commit 82179a7 into develop Sep 2, 2022
@Shinnnyshinshin Shinnnyshinshin deleted the f/GREAT-465/GREAT-1204/adding-serialization-for-schema-in-spark branch September 2, 2022 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants