Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration #7573

Conversation

alexsherstinsky
Copy link
Contributor

@alexsherstinsky alexsherstinsky commented Apr 5, 2023

Scope

  • While YAML configuration representations are consistent with object representation in Fluent Datasources object structures and the requirements for YAML view of configuration to be human readable, dictionary and JSON representations have contained duplicated "name" value as the key to object structures. This key is superfluous (and required additional error handling, let alone being logically redundant), because the object structures themselves already contain the "name" attribute and are thus self-contained. The present effort repairs this duplication.
  • Additional convenience methods are implemented in GxConfig and Datasource classes (with docstrings included).

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

  • JIRA: GREAT-1709/GREAT-1755

After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

Alex Sherstinsky added 12 commits April 1, 2023 14:57
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
@netlify
Copy link

netlify bot commented Apr 5, 2023

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit c22a5e6
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/6437252915187f0008997bec
😎 Deploy Preview https://deploy-preview-7573--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@ghost
Copy link

ghost commented Apr 5, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

Alex Sherstinsky added 3 commits April 5, 2023 14:46
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
@alexsherstinsky alexsherstinsky marked this pull request as ready for review April 5, 2023 21:57
Copy link
Member

@Kilo59 Kilo59 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had agreed we weren't making this change yet.

Having said that, I broadly agree with the change and the way it's implemented.

I still think it would be better to do this in two steps, first by implementing __get_item__/__set_item__,__iter__ on Datasource and use those to get/update and iterate over assets, and then the second PR could purely be about the config representation changes.

I don't like that we are still special casing the yaml.

And I need to make sure this won't interfere with the fluent Cloud work.

Edit

Having spent more time thinking about this I don't understand why we are making this change.
I had initially thought it was because we were removing any of the serialization special casing for YAML. But we are still doing that, in-fact now the YAML special case deviates even more from the internal representation of the objects.

Comment on lines 545 to 547
self.assets.append(asset)
# Since dictionary keys are unique, this ensures uniqueness of "_DataAssetT" objects.
self.assets = list(self.get_assets_as_dict().values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to look before leaping here.
You are appending the asset before checking for a name conflict. If there is a conflict it's too late, the asset has already been appended.

This is where __set_item__ (or another dedicated method) can be useful. It can be responsible for checking your uniqueness and only set the item if it has a unique name.

Copy link
Contributor Author

@alexsherstinsky alexsherstinsky Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 I will look into it; my feeling is that we should borrow from the best of __set_item__ like you said, but not actually implement this "dunder" method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 I made an update -- please review.

Comment on lines +474 to +485
def get_assets_as_dict(self) -> MutableMapping[str, _DataAssetT]:
"""Returns available DataAsset objects as dictionary, with corresponding name as key.

Returns:
Dictionary of "_DataAssetT" objects with "name" attribute serving as key.
"""
asset: _DataAssetT
assets_as_dict: MutableMapping[str, _DataAssetT] = {
asset.name: asset for asset in self.assets
}

return assets_as_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be a private method. If we need it at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 We are calling it from a DataContext subclass. With this fact, is your recommendation the same (change to private)? Or keep as public? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The great_expectations.yml example hasn't changed.
So we are still special casing the yaml... why?
Our json and dicts now have an asset field that are lists of objects/dicts but our yaml is not a list of objects.
So yaml.load() of our yaml looks like this..

{
    "fluent_datasources": {
        "my_pandas_abs_ds": {
            "assets": {
                "my_csv_asset_w_custom_connect_options": {
                    "connect_options": {
                        "abs_container": "test",
                        "abs_delimiter": "/",
                        "abs_name_starts_with": "",
                    },
                    "delimiter": ",",
                    "type": "csv",
                },
                "my_csv_asset_with_default_connect_options": {
                    "connect_options": {"abs_container": "this_is_always_required"},
                    "delimiter": ",",
                    "name": "my_csv_asset_with_default_connect_options",
                    "type": "csv",
                },
            },
            "name": "my_pandas_abs_ds",
            "type": "pandas_abs",
        },
        "my_pandas_fs_ds": {
            "assets": {
                "my_csv_asset_w_custom_connect_options": {
                    "connect_options": {"glob_directive": "**/*.csv"},
                    "sep": ",",
                    "type": "csv",
                },
                "my_csv_asset_with_default_connect_options": {
                    "sep": ",",
                    "type": "csv",
                },
            },
            "base_directory": ".",
            "name": "my_pandas_fs_ds",
            "type": "pandas_filesystem",
        },
    }
}

But for json/dict it's like this...

{
    "fluent_datasources": {
        "my_pandas_abs_ds": {
            "assets": [
                {
                    "connect_options": {
                        "abs_container": "test",
                        "abs_delimiter": "/",
                        "abs_name_starts_with": "",
                    },
                    "delimiter": ",",
                    "name": "my_csv_asset_w_custom_connect_options",
                    "type": "csv",
                },
                {
                    "connect_options": {"abs_container": "this_is_always_required"},
                    "delimiter": ",",
                    "name": "my_csv_asset_with_default_connect_options",
                    "type": "csv",
                },
            ],
            "name": "my_pandas_abs_ds",
            "type": "pandas_abs",
        },
        "my_pandas_fs_ds": {
            "assets": [
                {
                    "connect_options": {"glob_directive": "**/*.csv"},
                    "name": "my_csv_asset_w_custom_connect_options",
                    "sep": ",",
                    "type": "csv",
                },
                {
                    "name": "my_csv_asset_with_default_connect_options",
                    "sep": ",",
                    "type": "csv",
                },
            ],
            "base_directory": ".",
            "name": "my_pandas_fs_ds",
            "type": "pandas_filesystem",
        },
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 One of the requirements is that YAML is special: YAML should read the way it has been thus far, while JSON and Dict are more object-like. Thanks.

Copy link
Member

@Kilo59 Kilo59 Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this warrants further discussion (as a team) before merging.

Copy link
Contributor Author

@alexsherstinsky alexsherstinsky Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 We had the discussion about this at the Architecture Review meeting earlier today. While everybody understands the problem and the examples, the Team does not have a strong opinion on the format; so we agreed to add @donaldheppner as a reviewer, since @donaldheppner was an originator of the requirement for the YAML and JSON / Dictionary serialization formats to be different for product considerations. Thank you!

@@ -1629,6 +1629,7 @@ class DataContextConfigSchema(Schema):
allow_none=True,
)
fluent_datasources = fields.Dict(
keys=fields.Str(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing?

Copy link
Contributor Author

@alexsherstinsky alexsherstinsky Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 Style -- nothing special here. Just indicating the keys are String-valued.

Comment on lines 147 to 150
datasources_as_dict: Dict[str, Datasource] = self.get_datasources_as_dict()
datasources_as_dict.update(datasources)
# Since dictionary keys are unique, this ensures uniqueness of "Datasource" objects.
self.fluent_datasources = list(datasources_as_dict.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way of finding the right datasources to update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 I tried using set() first, but our Datasource and DataAsset structures are non-hashable, so this did not work (that is all right -- I do not think that we should rush to implement dunder hash as using dictionary works well). Thanks.

Alex Sherstinsky added 4 commits April 6, 2023 08:55
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
Alex Sherstinsky added 4 commits April 6, 2023 14:42
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
Copy link
Member

@donaldheppner donaldheppner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking this PR pending a full review

…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
@donaldheppner donaldheppner enabled auto-merge (squash) April 12, 2023 21:40
@donaldheppner donaldheppner merged commit 4aec193 into develop Apr 12, 2023
@donaldheppner donaldheppner deleted the maintenance/GREAT-1709/GREAT-1755/alexsherstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37 branch April 12, 2023 22:26
Shinnnyshinshin pushed a commit to richardohara/great_expectations that referenced this pull request Apr 14, 2023
* develop: (96 commits)
  [DOCS] Updates the "Interactive Mode" guide for creating Expectations (great-expectations#7624)
  [MAINTENANCE] Utilize `NotImported` for SQLAlchemy, Google Cloud Services, Azure Blob Storage, and Spark import usage (great-expectations#7617)
  [BUGFIX] MapCondition Memory Inefficiencies in Spark (great-expectations#7626)
  [DOCS] Update overview.md (great-expectations#7627)
  [MAINTENANCE] Update `teams.yml` (great-expectations#7623)
  [BUGFIX] fix marshmallow schema for SQLAlchemy `connect_args` passthrough (great-expectations#7614)
  [RELEASE] 0.16.7 (great-expectations#7622)
  [DOCS] Corrects Heading Issue in How to host and share Data Docs on Azure Blob Storage (great-expectations#7620)
  [DOCS] Corrects Step Numbering in How to instantiate a specific Filesystem Data Context (great-expectations#7612)
  [BUGFIX] Remove spark from bic Expectations since it never worked for them (great-expectations#7619)
  [MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration (great-expectations#7573)
  [DOCS] Add scripts under test for "How to create and edit Expectations with instant feedback from a sample Batch of data" (great-expectations#7615)
  [MAINTENANCE] Explicitly test relevant modules in Sqlalchemy compatibility pipeline (great-expectations#7613)
  [MAINTENANCE] Deprecate TableExpectation in favor of BatchExpectation (great-expectations#7610)
  [MAINTENANCE]  Deprecate ColumnExpectation in favor of ColumnAggregateExpectation (great-expectations#7609)
  [DOCS] Correct expectation documentation for expect_column_max_to_be_between (great-expectations#7597)
  [BUGFIX] Misc gallery bugfixes (great-expectations#7611)
  [MAINTENANCE] SqlAlchemy 2 Compatibility - `engine.execute()` (great-expectations#7469)
  [BUGFIX] `dataset_name` made optional parameter for Expectations (great-expectations#7603)
  [FEATURE] Added AssumeRole Feature (great-expectations#7547)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants