Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Update to MultiBatch Notebook to include Inferred - Sql #5958

Conversation

Shinnnyshinshin
Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin commented Sep 8, 2022

Changes proposed in this pull request:

  • Follow-up to PR [MAINTENANCE] Certify InferredAssetSqlDataConnector and ConfiguredAssetSqlDataConnector #5847 which certified InferredAssetSqlDataConnector.
  • Multi-Batch Example Notebook for SqlDataConnector (postgres). Notebooks takes tables corresponding to 2020 Taxi data and do the following:
    • Adds Datasource that contains 1 DataConnector with 2 assets that facilitate loading the data as a single Batch or split into multiple Batches.
    • Provides examples of BatchRequests that can load data as single or multiple batches
    • Shows how the resulting batch_list can be used with self-initializing Expectations to estimate parameters.
    • Shows how the resulting ExpectationSuite can be used to run a SimpleCheckpoint.

Note

  • Added Appendix to the end of notebook which describes parameters and their function.

Definition of Done

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

@netlify
Copy link

netlify bot commented Sep 8, 2022

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit 86311a1
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/631b739d9db7970009cc30d8
😎 Deploy Preview https://deploy-preview-5958--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@Shinnnyshinshin Shinnnyshinshin self-assigned this Sep 8, 2022
@Shinnnyshinshin Shinnnyshinshin requested a review from a team September 8, 2022 01:52
@ghost
Copy link

ghost commented Sep 8, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you for clearly laying this out. I have some questions and suggested changes. Happy to discuss if helpful.

{
"data": {
"text/plain": [
"[<great_expectations.core.batch.Batch at 0x7f7d61cd5f40>,\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this showing 12 batches if it is called a single_batch_batch_request?

"metadata": {},
"outputs": [],
"source": [
"validator.save_expectation_suite()"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should look at the expectation stored in the suite here (and see the params stored, discuss that they are from the min/max median value of all batches)?

"id": "477381f5",
"metadata": {},
"source": [
"Now the ExpectationSuite can be used to validate single batches using a Checkpoint. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020_01`."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where these tables are coming from - I thought we only had one table yellow_tripdata_sample_2020 and we are pulling different sets of batches from that?

"id": "477381f5",
"metadata": {},
"source": [
"Now the ExpectationSuite can be used to validate single batches using a Checkpoint. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020_01`."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Now the ExpectationSuite can be used to validate single batches using a Checkpoint. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020_01`."
"Now the ExpectationSuite we built using all batches can be used to validate single batches using a Checkpoint. For example, we can run this checkpoint on new data when it comes in next month. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020_01`."

{
"data": {
"text/plain": [
"False"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be True if we created the expectation based on all of the batches?

"The signature of the `InferredAssetSqlDataConnector` also contains the following required parameters:\n",
"- `name`: string describing the name of this DataConnector.\n",
"- `datasource_name`: the name of the Datasource that contains it.\n",
"- `execution_engine`: an ExecutionEngine.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- `execution_engine`: an ExecutionEngine.\n",
"- `execution_engine`: The type of ExecutionEngine to use.\n",

Will Shin added 6 commits September 8, 2022 17:13
…i-batch-notebook

* develop:
  [Maintenance] Randomize the non-comprehensive tests (#5968)
  [RELEASE] 0.15.22 (#5973)
  [BUGFIX] Spark column.distinct_values no longer returns entire table distinct values (#5969)
  [MAINTENANCE] Use DataContext to ignore progress bars (#5959)
  [MAINTENANCE] Bump `Marshmallow` upper bound to work with Airflow operator (#5952)
  [MAINTENANCE] Add x-fails to flaky Cloud tests for purposes of 0.15.22 (#5964)
  [BUGFIX] Prevent "division by zero" errors in Rule-Based Profiler calculations when Batch has zero rows (#5960)
  [FEATURE] Improve slack error condition (#5818)
…ok' of https://github.com/great-expectations/great_expectations into m/GREAT-1226/GREAT-1228/inferred-sql-multi-batch-notebook

* 'm/GREAT-1226/GREAT-1228/inferred-sql-multi-batch-notebook' of https://github.com/great-expectations/great_expectations:
  [MAINTENANCE] Expectation suite new unit tests for add_citation (#5966)
  [MAINTENANCE] Expectation suite init unit tests + types (#5957)
  [MAINTENANCE] DatasourceStore refactoring (#5941)
Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some tiny nits

"metadata": {},
"source": [
"### Typical Workflow\n",
"A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they us a `RuleBasedProfiler` under-the-hood to calculate parameters."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they us a `RuleBasedProfiler` under-the-hood to calculate parameters."
"A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they use a `RuleBasedProfiler` under-the-hood to calculate parameters."

"id": "f18d096f",
"metadata": {},
"source": [
"The observed value for our `yellow_tripdata_sample_2020` table where `trip_distance` is going to be `1.75`, which means the Expectation fails. We guessed wrong - but we can do better!\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The observed value for our `yellow_tripdata_sample_2020` table where `trip_distance` is going to be `1.75`, which means the Expectation fails. We guessed wrong - but we can do better!\""
"The observed value for our `yellow_tripdata_sample_2020` table where `trip_distance` is going to be `1.75`, which means the Expectation fails. We guessed wrong - but we can do better!"

"id": "477381f5",
"metadata": {},
"source": [
"Now the ExpectationSuite we built using all batches can be used to validate single batches using a Checkpoint. For example, we can run this checkpoint on new data when it comes in next month. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020`."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Now the ExpectationSuite we built using all batches can be used to validate single batches using a Checkpoint. For example, we can run this checkpoint on new data when it comes in next month. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020`."
"Now the ExpectationSuite we built using all batches can be used to validate single batches using a Checkpoint. For example, we can run this checkpoint on new data when it comes in next month. In our example, let's validate a different batch from February 2020, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020`."

"The signature of the `InferredAssetSqlDataConnector` also contains the following required parameters:\n",
"- `name`: string describing the name of this DataConnector.\n",
"- `datasource_name`: the name of the Datasource that contains it.\n",
"- `execution_engine`: the type of ExecutionEngine to use\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- `execution_engine`: the type of ExecutionEngine to use\n",
"- `execution_engine`: the type of ExecutionEngine to use.\n",

Comment on lines 1145 to 1146
"- `excluded_tables`: A list of tables to ignore when inferring data asset_names\n",
"- `included_tables`: If not `None`, only include tables in this list when inferring data asset_names\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- `excluded_tables`: A list of tables to ignore when inferring data asset_names\n",
"- `included_tables`: If not `None`, only include tables in this list when inferring data asset_names\n",
"- `excluded_tables`: A list of tables to ignore when inferring data asset_names.\n",
"- `included_tables`: If not `None`, only include tables in this list when inferring data asset_names.\n",

"- `included_tables`: If not `None`, only include tables in this list when inferring data asset_names\n",
"- `skip_inapplicable_tables`: If `True`, tables that can't be successfully queried using sampling and splitter methods are excluded from inferred data_asset_names. If `False`, the class will throw an error during initialization if any such tables are encountered.\n",
"- `batch_spec_passthrough`: dictionary with keys that will be added directly to batch_spec.\n",
"- `introspection_directives`: Arguments passed to the introspection method to guide introspection\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- `introspection_directives`: Arguments passed to the introspection method to guide introspection\n",
"- `introspection_directives`: Arguments passed to the introspection method to guide introspection.\n",

Will Shin added 3 commits September 9, 2022 09:45
…i-batch-notebook

* develop:
  [MAINTENANCE] Enhance unit tests for ExpectationSuite.isEquivalentTo (#5979)
@Shinnnyshinshin Shinnnyshinshin enabled auto-merge (squash) September 9, 2022 17:10
@Shinnnyshinshin Shinnnyshinshin merged commit 8f0eb08 into develop Sep 9, 2022
@Shinnnyshinshin Shinnnyshinshin deleted the m/GREAT-1226/GREAT-1228/inferred-sql-multi-batch-notebook branch September 9, 2022 17:21
Shinnnyshinshin pushed a commit that referenced this pull request Sep 9, 2022
* develop:
  [BUGFIX] Fix failing `run_profiler_notebook` test (#5983)
  [FEATURE] Refactor PartitionParameterBuilder into dedicated ValueCountsParameterBuilder and HistogramParameterBuilder (#5975)
  [MAINTENANCE] Add reverse assertion for isEquivalentTo tests (#5982)
  [MAINTENANCE] Update to MultiBatch Notebook to include Inferred - Sql  (#5958)
  [MAINTENANCE] Update to MultiBatch Notebook to include Configured - Sql (#5945)
  [MAINTENANCE] Remove unused fixtures from test suite (#5965)
  [MAINTENANCE] Enhance unit tests for ExpectationSuite.isEquivalentTo (#5979)
  [MAINTENANCE] Unit tests for `CheckpointStore` (#5967)
  [FEATURE] do not require expectation_suite_name in DataAssistantResult.show_expectations_by...() methods (#5976)
  [MAINTENANCE] Updated release schedule (#5977)
  [BUGFIX] Addresses issue with ExpectCompoundColumnsToBeUnique renderer (#5970)
  [MAINTENANCE] Expectation suite new unit tests for add_citation (#5966)
  [MAINTENANCE] Expectation suite init unit tests + types (#5957)
  [MAINTENANCE] DatasourceStore refactoring (#5941)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants