Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add support to AWS Glue Data Catalog #5123

Merged
merged 66 commits into from
Oct 25, 2022

Conversation

lccasagrande
Copy link
Contributor

@lccasagrande lccasagrande commented May 15, 2022

Changes proposed in this pull request:

  • Add ConfiguredAssetGlueCatalogDataConnector to explicitly define database and tables from AWS Glue Data Catalog
  • Add InferredAssetGlueCatalogDataConnector to implicitly list database and tables from AWS Glue Data Catalog
  • Upg moto pkg to mock glue.get_databases() function
  • Upd SparkDFExecutionEngine to read data from the catalog (hive or glue), using the new GlueCatalogBatchSpec class
  • Add a new BatchSpec for data assets from GlueCatalogDataConnectors
  • Upd BaseDataContext to include GlueCatalogDataConnectors
  • Upd DataConnectorConfigSchema to include GlueCatalogDataConnectors fields

Previous Design Review notes:

Definition of Done

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added unit tests where applicable and made sure that new and existing tests are passing.

@netlify
Copy link

netlify bot commented May 15, 2022

👷 Deploy request for niobium-lead-7998 pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit c7e9b44

@rdodev
Copy link
Contributor

rdodev commented May 25, 2022

Howdy @lccasagrande thanks a bunch for the contrib. I'll take a look and leave comments as necessary.

@lccasagrande
Copy link
Contributor Author

@rdodev is there anything more I can do in this PR? looking forward to be able to use it.

@rdodev
Copy link
Contributor

rdodev commented Jun 21, 2022

@rdodev is there anything more I can do in this PR? looking forward to be able to use it.

So very sorry @lccasagrande thanks for re-raising this PR. I will ask engineering to take a look.

Copy link
Contributor

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many failing tests; I would like to re-review this interesting contribution after the working branch is updated to the latest and the failing tests pass. Additional questions:

  1. Is this focused mainly on connecting the the AWS Glue Catalog as an SQL backend?
  2. Is it also possible to make the utilization of AWS Glue as a Spark execution engine seamless?

Thank you very much for this work and looking forward to updates. /cc @talagluck

@lccasagrande
Copy link
Contributor Author

I will rebase and re-run tests. Answering your questions:

Q) Is this focused mainly on connecting the the AWS Glue Catalog as an SQL backend?

Yes, it enables the discovery and usage of database tables cataloged in Glue Data Catalog without the need to use the InferredAssetS3DataConnector. Although it was not tested already, but it may allow the usage of this framework with database tables shared through the AWS Lake Formation, like in data mesh architectures.

Q) Is it also possible to make the utilization of AWS Glue as a Spark execution engine seamless? Yes it is. The Glue Data Catalog connector is the first step. The AWS Glue Execution engine will be the second one. Lets finish this PR first and, after that, we can discuss the AWS Glue Execution Engine. I did not want to make a PR with too much contributions, but it is possible and I already have an working sample that we developed as a plugin.

lccasagrande and others added 17 commits July 20, 2022 18:02
…ectations#5554)

* ipywidgets not required for dev since it is in requirements.txt
…ta_context_id` (great-expectations#5555)

* fix: use helper method for project config logic

* docs: add docstring

* test: write tests
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Chetan Kini <chetan@superconductive.com>
…s implemented `_init_datasource_store` (great-expectations#5543)

* feat: init commit

* feat: finish initial impl

* test: write integration tests

* chore: remove unused imports

* docs: add comments
@alexsherstinsky
Copy link
Contributor

@lccasagrande Thank you for your updates and replies. I did another round of review and left a few comments. I feel that at this point, we should be maintaining the working branch that passes all Azure checks so that once this pull request is approved (meaning only minor changes, if any at all would be remaining), then we can just merge it upon approval. Thanks!

@@ -7,7 +7,7 @@ Ipython>=7.16.3
ipywidgets>=7.5.1
jinja2>=2.10
jsonpatch>=1.22
jsonschema>=2.5.1
jsonschema>=2.5.1,<=4.7.2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to set the max jsonschema pkg version because, with latest versions the error message from validate() is not the one that is being tested here:
assert "'variable_count' is a required property" in str(e.value).

I only got this problem with this test. I see two choices:

  1. Upd the message in the test assertion and do not set the max pkg version.
  2. Set the max pkg version and keep it as is.

I followed the second one.

@alexsherstinsky
Copy link
Contributor

@lccasagrande Thank you for all this great work! How about now changing the status of this pull request to be "regular" (as in "not Draft Pull Request" any more and updating the branch with the latest develop -- then you can set it on "auto-merge" so that once it is approved, it will track any future updates with the develop branch automatically and then get merged? Thank you very much!

@lccasagrande lccasagrande marked this pull request as ready for review October 24, 2022 16:38
Copy link
Contributor

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lccasagrande LGTM -- thank you so much for this very valuable contribution to Great Expectations and for your diligent efforts over the many weeks and months! /cc @kyleaton @talagluck @donaldheppner

@alexsherstinsky alexsherstinsky enabled auto-merge (squash) October 24, 2022 20:33
@alexsherstinsky alexsherstinsky merged commit 82ccff2 into great-expectations:develop Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS Glue Catalog Data Connector Support
10 participants