[MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration #7573

alexsherstinsky · 2023-04-05T21:35:44Z

Scope

While YAML configuration representations are consistent with object representation in Fluent Datasources object structures and the requirements for YAML view of configuration to be human readable, dictionary and JSON representations have contained duplicated "name" value as the key to object structures. This key is superfluous (and required additional error handling, let alone being logically redundant), because the object structures themselves already contain the "name" attribute and are thus self-contained. The present effort repairs this duplication.
Additional convenience methods are implemented in GxConfig and Datasource classes (with docstrings included).

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

JIRA: GREAT-1709/GREAT-1755

After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added unit tests where applicable and made sure that new and existing tests are passing.
I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

…ion.

netlify · 2023-04-05T21:36:02Z

✅ Deploy Preview for niobium-lead-7998 ready!

Name	Link
🔨 Latest commit	`c22a5e6`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/6437252915187f0008997bec
😎 Deploy Preview	https://deploy-preview-7573--niobium-lead-7998.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

ghost · 2023-04-05T21:38:31Z

👇 Click on the image for a new way to code review

Legend

…ion.

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

…ion.

…now.

…ion.

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Kilo59

I thought we had agreed we weren't making this change yet.

~~Having said that, I broadly agree with the change and the way it's implemented.~~

I still think it would be better to do this in two steps, first by implementing __get_item__/__set_item__,__iter__ on Datasource and use those to get/update and iterate over assets, and then the second PR could purely be about the config representation changes.

I don't like that we are still special casing the yaml.

And I need to make sure this won't interfere with the fluent Cloud work.

Edit

Having spent more time thinking about this I don't understand why we are making this change.
I had initially thought it was because we were removing any of the serialization special casing for YAML. But we are still doing that, in-fact now the YAML special case deviates even more from the internal representation of the objects.

Kilo59 · 2023-04-06T12:56:42Z

great_expectations/datasource/fluent/interfaces.py

+        self.assets.append(asset)
+        # Since dictionary keys are unique, this ensures uniqueness of "_DataAssetT" objects.
+        self.assets = list(self.get_assets_as_dict().values())


I think we need to look before leaping here.
You are appending the asset before checking for a name conflict. If there is a conflict it's too late, the asset has already been appended.

This is where __set_item__ (or another dedicated method) can be useful. It can be responsible for checking your uniqueness and only set the item if it has a unique name.

@Kilo59 I will look into it; my feeling is that we should borrow from the best of __set_item__ like you said, but not actually implement this "dunder" method.

@Kilo59 I made an update -- please review.

Kilo59 · 2023-04-06T13:03:40Z

great_expectations/datasource/fluent/interfaces.py

+    def get_assets_as_dict(self) -> MutableMapping[str, _DataAssetT]:
+        """Returns available DataAsset objects as dictionary, with corresponding name as key.
+
+        Returns:
+            Dictionary of "_DataAssetT" objects with "name" attribute serving as key.
+        """
+        asset: _DataAssetT
+        assets_as_dict: MutableMapping[str, _DataAssetT] = {
+            asset.name: asset for asset in self.assets
+        }
+
+        return assets_as_dict


This seems like it should be a private method. If we need it at all.

@Kilo59 We are calling it from a DataContext subclass. With this fact, is your recommendation the same (change to private)? Or keep as public? Thanks.

Kilo59 · 2023-04-06T13:12:34Z

tests/datasource/fluent/test_config.py

The great_expectations.yml example hasn't changed.
So we are still special casing the yaml... why?
Our json and dicts now have an asset field that are lists of objects/dicts but our yaml is not a list of objects.
So yaml.load() of our yaml looks like this..

{ "fluent_datasources": { "my_pandas_abs_ds": { "assets": { "my_csv_asset_w_custom_connect_options": { "connect_options": { "abs_container": "test", "abs_delimiter": "/", "abs_name_starts_with": "", }, "delimiter": ",", "type": "csv", }, "my_csv_asset_with_default_connect_options": { "connect_options": {"abs_container": "this_is_always_required"}, "delimiter": ",", "name": "my_csv_asset_with_default_connect_options", "type": "csv", }, }, "name": "my_pandas_abs_ds", "type": "pandas_abs", }, "my_pandas_fs_ds": { "assets": { "my_csv_asset_w_custom_connect_options": { "connect_options": {"glob_directive": "**/*.csv"}, "sep": ",", "type": "csv", }, "my_csv_asset_with_default_connect_options": { "sep": ",", "type": "csv", }, }, "base_directory": ".", "name": "my_pandas_fs_ds", "type": "pandas_filesystem", }, } }

But for json/dict it's like this...

{ "fluent_datasources": { "my_pandas_abs_ds": { "assets": [ { "connect_options": { "abs_container": "test", "abs_delimiter": "/", "abs_name_starts_with": "", }, "delimiter": ",", "name": "my_csv_asset_w_custom_connect_options", "type": "csv", }, { "connect_options": {"abs_container": "this_is_always_required"}, "delimiter": ",", "name": "my_csv_asset_with_default_connect_options", "type": "csv", }, ], "name": "my_pandas_abs_ds", "type": "pandas_abs", }, "my_pandas_fs_ds": { "assets": [ { "connect_options": {"glob_directive": "**/*.csv"}, "name": "my_csv_asset_w_custom_connect_options", "sep": ",", "type": "csv", }, { "name": "my_csv_asset_with_default_connect_options", "sep": ",", "type": "csv", }, ], "base_directory": ".", "name": "my_pandas_fs_ds", "type": "pandas_filesystem", }, } }

@Kilo59 One of the requirements is that YAML is special: YAML should read the way it has been thus far, while JSON and Dict are more object-like. Thanks.

I think this warrants further discussion (as a team) before merging.

@Kilo59 We had the discussion about this at the Architecture Review meeting earlier today. While everybody understands the problem and the examples, the Team does not have a strong opinion on the format; so we agreed to add @donaldheppner as a reviewer, since @donaldheppner was an originator of the requirement for the YAML and JSON / Dictionary serialization formats to be different for product considerations. Thank you!

Kilo59 · 2023-04-06T14:36:02Z

great_expectations/data_context/types/base.py

@@ -1629,6 +1629,7 @@ class DataContextConfigSchema(Schema):
        allow_none=True,
    )
    fluent_datasources = fields.Dict(
+        keys=fields.Str(),


Why is this changing?

@Kilo59 Style -- nothing special here. Just indicating the keys are String-valued.

Kilo59 · 2023-04-06T14:39:22Z

great_expectations/datasource/fluent/config.py

+        datasources_as_dict: Dict[str, Datasource] = self.get_datasources_as_dict()
+        datasources_as_dict.update(datasources)
+        # Since dictionary keys are unique, this ensures uniqueness of "Datasource" objects.
+        self.fluent_datasources = list(datasources_as_dict.values())


This is a clever way of finding the right datasources to update.

@Kilo59 I tried using set() first, but our Datasource and DataAsset structures are non-hashable, so this did not work (that is all right -- I do not think that we should rush to implement dunder hash as using dictionary works well). Thanks.

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

donaldheppner

blocking this PR pending a full review

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

* develop: (96 commits) [DOCS] Updates the "Interactive Mode" guide for creating Expectations (great-expectations#7624) [MAINTENANCE] Utilize `NotImported` for SQLAlchemy, Google Cloud Services, Azure Blob Storage, and Spark import usage (great-expectations#7617) [BUGFIX] MapCondition Memory Inefficiencies in Spark (great-expectations#7626) [DOCS] Update overview.md (great-expectations#7627) [MAINTENANCE] Update `teams.yml` (great-expectations#7623) [BUGFIX] fix marshmallow schema for SQLAlchemy `connect_args` passthrough (great-expectations#7614) [RELEASE] 0.16.7 (great-expectations#7622) [DOCS] Corrects Heading Issue in How to host and share Data Docs on Azure Blob Storage (great-expectations#7620) [DOCS] Corrects Step Numbering in How to instantiate a specific Filesystem Data Context (great-expectations#7612) [BUGFIX] Remove spark from bic Expectations since it never worked for them (great-expectations#7619) [MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration (great-expectations#7573) [DOCS] Add scripts under test for "How to create and edit Expectations with instant feedback from a sample Batch of data" (great-expectations#7615) [MAINTENANCE] Explicitly test relevant modules in Sqlalchemy compatibility pipeline (great-expectations#7613) [MAINTENANCE] Deprecate TableExpectation in favor of BatchExpectation (great-expectations#7610) [MAINTENANCE] Deprecate ColumnExpectation in favor of ColumnAggregateExpectation (great-expectations#7609) [DOCS] Correct expectation documentation for expect_column_max_to_be_between (great-expectations#7597) [BUGFIX] Misc gallery bugfixes (great-expectations#7611) [MAINTENANCE] SqlAlchemy 2 Compatibility - `engine.execute()` (great-expectations#7469) [BUGFIX] `dataset_name` made optional parameter for Expectations (great-expectations#7603) [FEATURE] Added AssumeRole Feature (great-expectations#7547) ...

Alex Sherstinsky added 12 commits April 1, 2023 14:57

WIP

c9fdec6

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

675b823

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

a878a34

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

WIP

6f14fee

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

e57d318

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

WIP

4623131

WIP

87550e9

WIP

03c12c3

WIP

bb589f8

Deduplicate Datasource and DataAsset name as key in object serializat…

3d572ba

…ion.

Deduplicate Datasource and DataAsset name as key in object serializat…

bcd032a

…ion.

merge

3e2f298

github-actions bot added core-team platform labels Apr 5, 2023

Alex Sherstinsky added 3 commits April 5, 2023 14:46

Deduplicate Datasource and DataAsset name as key in object serializat…

e6f6d7b

…ion.

Deduplicate Datasource and DataAsset name as key in object serializat…

286b012

…ion.

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

78c98a4

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

alexsherstinsky marked this pull request as ready for review April 5, 2023 21:57

Alex Sherstinsky added 10 commits April 5, 2023 15:07

Deduplicate Datasource and DataAsset name as key in object serializat…

77ee9af

…ion.

Deduplicate Datasource and DataAsset name as key in object serializat…

55da83b

…ion.

We added a public API method, but it does not need the decorator for …

beed972

…now.

Deduplicate Datasource and DataAsset name as key in object serializat…

f9a67b5

…ion.

Deduplicate Datasource and DataAsset name as key in object serializat…

de0aeed

…ion.

Deduplicate Datasource and DataAsset name as key in object serializat…

871cd4d

…ion.

Deduplicate Datasource and DataAsset name as key in object serializat…

c51320a

…ion.

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

3407ae9

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

JSON schemas

deccb13

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

7e59ac4

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Kilo59 reviewed Apr 6, 2023

View reviewed changes

Alex Sherstinsky added 4 commits April 6, 2023 08:55

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

dda09a9

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

improve error handling

7681be5

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

aed0b2b

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Improve handling of reserved Pandas ephemeral data asset name.

18b7bcf

alexsherstinsky requested a review from Kilo59 April 6, 2023 19:12

Alex Sherstinsky added 4 commits April 6, 2023 14:42

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

2097b2f

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

clean up -- EXPERIMENTAL -> Fluent

47dda46

merge

c06f938

API report adjustment

9b93db6

alexsherstinsky requested a review from donaldheppner April 7, 2023 17:37

alexsherstinsky enabled auto-merge (squash) April 7, 2023 17:38

github-actions bot added 9 commits April 7, 2023 21:10

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

b2f8321

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

d0b89b6

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

3803aa1

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

d8602d5

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

97fa31a

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

bfa315b

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

f9ecdc8

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

94b6f23

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

657ba62

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

donaldheppner requested changes Apr 10, 2023

View reviewed changes

Merge develop into maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/…

440fc7b

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

donaldheppner disabled auto-merge April 10, 2023 22:20

donaldheppner approved these changes Apr 12, 2023

View reviewed changes

Merge branch 'develop' into maintenance/GREAT-1709/GREAT-1755/alexshe…

c22a5e6

…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37

donaldheppner enabled auto-merge (squash) April 12, 2023 21:40

donaldheppner merged commit 4aec193 into develop Apr 12, 2023

donaldheppner deleted the maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37 branch April 12, 2023 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration #7573

[MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration #7573

alexsherstinsky commented Apr 5, 2023 •

edited

Loading

netlify bot commented Apr 5, 2023 •

edited

Loading

ghost commented Apr 5, 2023 •

edited by ghost

Loading

Legend

Kilo59 left a comment •

edited

Loading

Kilo59 Apr 6, 2023

alexsherstinsky Apr 6, 2023 •

edited

Loading

alexsherstinsky Apr 6, 2023

Kilo59 Apr 6, 2023

alexsherstinsky Apr 6, 2023

Kilo59 Apr 6, 2023

alexsherstinsky Apr 6, 2023

Kilo59 Apr 6, 2023 •

edited

Loading

alexsherstinsky Apr 7, 2023 •

edited

Loading

Kilo59 Apr 6, 2023

alexsherstinsky Apr 6, 2023 •

edited

Loading

Kilo59 Apr 6, 2023

alexsherstinsky Apr 6, 2023

donaldheppner left a comment

[MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration #7573

[MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration #7573

Conversation

alexsherstinsky commented Apr 5, 2023 • edited Loading

Scope

Previous Design Review notes:

Definition of Done

netlify bot commented Apr 5, 2023 • edited Loading

✅ Deploy Preview for niobium-lead-7998 ready!

ghost commented Apr 5, 2023 • edited by ghost Loading

Legend

Kilo59 left a comment • edited Loading

Choose a reason for hiding this comment

Edit

Choose a reason for hiding this comment

alexsherstinsky Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kilo59 Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

alexsherstinsky Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexsherstinsky Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donaldheppner left a comment

Choose a reason for hiding this comment

alexsherstinsky commented Apr 5, 2023 •

edited

Loading

netlify bot commented Apr 5, 2023 •

edited

Loading

ghost commented Apr 5, 2023 •

edited by ghost

Loading

Kilo59 left a comment •

edited

Loading

alexsherstinsky Apr 6, 2023 •

edited

Loading

Kilo59 Apr 6, 2023 •

edited

Loading

alexsherstinsky Apr 7, 2023 •

edited

Loading

alexsherstinsky Apr 6, 2023 •

edited

Loading