Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Adding serialization tests for Spark #5897

Conversation

Shinnnyshinshin
Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin commented Aug 31, 2022

Changes proposed in this pull request:

  • 2 tests added to test_serialization for serializing objects with SparkDFExecutionEngine.
  • 1 is a Checkpoint test
  • 1 is a Datasource test
  • Both will be set-up for allowing the serialization to happen automatically with a spark schema object

After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

@netlify
Copy link

netlify bot commented Aug 31, 2022

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit e987124
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/630fd29f293f30000876de12
😎 Deploy Preview https://deploy-preview-5897--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@ghost
Copy link

ghost commented Aug 31, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

@@ -696,6 +704,133 @@ def test_checkpoint_config_and_nested_objects_are_serialized(
)


@pytest.mark.unit
def test_checkpoint_config_and_nested_objects_are_serialized_spark(spark_session):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By nature of having a Spark session here, I think this is an integration test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Updated

)


def test_serialization_of_datasource_with_nested_objects_spark(spark_session):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please annotate with mark 🙇🏽

Comment on lines +801 to +824
expected_serialized_datasource_config: dict = {
"data_connectors": {
"configured_asset_connector": {
"assets": {
"my_asset": {
"batch_spec_passthrough": {"reader_options": {"header": True}},
"class_name": "Asset",
"module_name": "great_expectations.datasource.data_connector.asset",
}
},
"class_name": "ConfiguredAssetFilesystemDataConnector",
"module_name": "great_expectations.datasource.data_connector.configured_asset_filesystem_data_connector",
}
},
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine.sparkdf_execution_engine",
},
"module_name": "great_expectations.datasource",
"class_name": "Datasource",
"name": "taxi_data",
}

observed_dump = datasourceConfigSchema.dump(obj=datasource_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we could take all of this and combine into one test test_serialization_something_about_spark and parameterize the schema and expected return values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one I dont feel as comfortable because we are testing two separate objects CheckpointConfig and DatasourceConfig. I feel like the rest of the tests keep them separated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up to this PR, I'm planning on adding 2 more tests (one as a CheckpointConfig test and one as a DatasourceConfig test), where we do the pre_dump() logic on the schema object. I think that would be a more appropriate way to parameterize the values.... what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me! thank you 🙇🏽

…rialization-checkpoint-and-datasource-spark
@Shinnnyshinshin Shinnnyshinshin merged commit 28df031 into develop Aug 31, 2022
@Shinnnyshinshin Shinnnyshinshin deleted the m/GREAT-465/GREAT-1204/adding-test-for-serialization-checkpoint-and-datasource-spark branch August 31, 2022 22:45
Shinnnyshinshin pushed a commit that referenced this pull request Sep 1, 2022
…in-spark' of https://github.com/great-expectations/great_expectations into f/GREAT-465/GREAT-1204/adding-serialization-for-schema-in-spark

* 'f/GREAT-465/GREAT-1204/adding-serialization-for-schema-in-spark' of https://github.com/great-expectations/great_expectations:
  [BUGFIX] Patch issue with `checkpoint_identifier` within `Checkpoint.run` workflow (#5894)
  [MAINTENANCE] Adding serialization tests for Spark (#5897)
  [MAINTENANCE] Add slow pytest marker to config and sort them alphabetically. (#5892)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants