-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration #7573
Conversation
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
✅ Deploy Preview for niobium-lead-7998 ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had agreed we weren't making this change yet.
Having said that, I broadly agree with the change and the way it's implemented.
I still think it would be better to do this in two steps, first by implementing __get_item__
/__set_item__
,__iter__
on Datasource and use those to get/update and iterate over assets, and then the second PR could purely be about the config representation changes.
I don't like that we are still special casing the yaml.
And I need to make sure this won't interfere with the fluent Cloud work.
Edit
Having spent more time thinking about this I don't understand why we are making this change.
I had initially thought it was because we were removing any of the serialization special casing for YAML. But we are still doing that, in-fact now the YAML special case deviates even more from the internal representation of the objects.
self.assets.append(asset) | ||
# Since dictionary keys are unique, this ensures uniqueness of "_DataAssetT" objects. | ||
self.assets = list(self.get_assets_as_dict().values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to look before leaping here.
You are appending the asset before checking for a name conflict. If there is a conflict it's too late, the asset has already been appended.
This is where __set_item__
(or another dedicated method) can be useful. It can be responsible for checking your uniqueness and only set the item if it has a unique name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 I will look into it; my feeling is that we should borrow from the best of __set_item__
like you said, but not actually implement this "dunder" method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 I made an update -- please review.
def get_assets_as_dict(self) -> MutableMapping[str, _DataAssetT]: | ||
"""Returns available DataAsset objects as dictionary, with corresponding name as key. | ||
|
||
Returns: | ||
Dictionary of "_DataAssetT" objects with "name" attribute serving as key. | ||
""" | ||
asset: _DataAssetT | ||
assets_as_dict: MutableMapping[str, _DataAssetT] = { | ||
asset.name: asset for asset in self.assets | ||
} | ||
|
||
return assets_as_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be a private method. If we need it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 We are calling it from a DataContext subclass. With this fact, is your recommendation the same (change to private)? Or keep as public? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The great_expectations.yml
example hasn't changed.
So we are still special casing the yaml... why?
Our json and dicts now have an asset
field that are lists of objects/dicts but our yaml is not a list of objects.
So yaml.load()
of our yaml looks like this..
{
"fluent_datasources": {
"my_pandas_abs_ds": {
"assets": {
"my_csv_asset_w_custom_connect_options": {
"connect_options": {
"abs_container": "test",
"abs_delimiter": "/",
"abs_name_starts_with": "",
},
"delimiter": ",",
"type": "csv",
},
"my_csv_asset_with_default_connect_options": {
"connect_options": {"abs_container": "this_is_always_required"},
"delimiter": ",",
"name": "my_csv_asset_with_default_connect_options",
"type": "csv",
},
},
"name": "my_pandas_abs_ds",
"type": "pandas_abs",
},
"my_pandas_fs_ds": {
"assets": {
"my_csv_asset_w_custom_connect_options": {
"connect_options": {"glob_directive": "**/*.csv"},
"sep": ",",
"type": "csv",
},
"my_csv_asset_with_default_connect_options": {
"sep": ",",
"type": "csv",
},
},
"base_directory": ".",
"name": "my_pandas_fs_ds",
"type": "pandas_filesystem",
},
}
}
But for json/dict it's like this...
{
"fluent_datasources": {
"my_pandas_abs_ds": {
"assets": [
{
"connect_options": {
"abs_container": "test",
"abs_delimiter": "/",
"abs_name_starts_with": "",
},
"delimiter": ",",
"name": "my_csv_asset_w_custom_connect_options",
"type": "csv",
},
{
"connect_options": {"abs_container": "this_is_always_required"},
"delimiter": ",",
"name": "my_csv_asset_with_default_connect_options",
"type": "csv",
},
],
"name": "my_pandas_abs_ds",
"type": "pandas_abs",
},
"my_pandas_fs_ds": {
"assets": [
{
"connect_options": {"glob_directive": "**/*.csv"},
"name": "my_csv_asset_w_custom_connect_options",
"sep": ",",
"type": "csv",
},
{
"name": "my_csv_asset_with_default_connect_options",
"sep": ",",
"type": "csv",
},
],
"base_directory": ".",
"name": "my_pandas_fs_ds",
"type": "pandas_filesystem",
},
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 One of the requirements is that YAML
is special: YAML should read the way it has been thus far, while JSON and Dict are more object-like. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warrants further discussion (as a team) before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 We had the discussion about this at the Architecture Review meeting earlier today. While everybody understands the problem and the examples, the Team does not have a strong opinion on the format; so we agreed to add @donaldheppner as a reviewer, since @donaldheppner was an originator of the requirement for the YAML
and JSON
/ Dictionary
serialization formats to be different for product considerations. Thank you!
@@ -1629,6 +1629,7 @@ class DataContextConfigSchema(Schema): | |||
allow_none=True, | |||
) | |||
fluent_datasources = fields.Dict( | |||
keys=fields.Str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 Style -- nothing special here. Just indicating the keys are String-valued.
datasources_as_dict: Dict[str, Datasource] = self.get_datasources_as_dict() | ||
datasources_as_dict.update(datasources) | ||
# Since dictionary keys are unique, this ensures uniqueness of "Datasource" objects. | ||
self.fluent_datasources = list(datasources_as_dict.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a clever way of finding the right datasources to update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 I tried using set()
first, but our Datasource
and DataAsset
structures are non-hashable, so this did not work (that is all right -- I do not think that we should rush to implement dunder hash as using dictionary works well). Thanks.
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking this PR pending a full review
…zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
…rstinsky/zep/eliminate_redundant_datasource_name_and_data_asset_name_from_dict_and_json_configuration_elegantly-2023_03_13-37
* develop: (96 commits) [DOCS] Updates the "Interactive Mode" guide for creating Expectations (great-expectations#7624) [MAINTENANCE] Utilize `NotImported` for SQLAlchemy, Google Cloud Services, Azure Blob Storage, and Spark import usage (great-expectations#7617) [BUGFIX] MapCondition Memory Inefficiencies in Spark (great-expectations#7626) [DOCS] Update overview.md (great-expectations#7627) [MAINTENANCE] Update `teams.yml` (great-expectations#7623) [BUGFIX] fix marshmallow schema for SQLAlchemy `connect_args` passthrough (great-expectations#7614) [RELEASE] 0.16.7 (great-expectations#7622) [DOCS] Corrects Heading Issue in How to host and share Data Docs on Azure Blob Storage (great-expectations#7620) [DOCS] Corrects Step Numbering in How to instantiate a specific Filesystem Data Context (great-expectations#7612) [BUGFIX] Remove spark from bic Expectations since it never worked for them (great-expectations#7619) [MAINTENANCE] Fluent Datasources: Eliminate redundant Datasource name and DataAsset name from dictionary and JSON configuration (great-expectations#7573) [DOCS] Add scripts under test for "How to create and edit Expectations with instant feedback from a sample Batch of data" (great-expectations#7615) [MAINTENANCE] Explicitly test relevant modules in Sqlalchemy compatibility pipeline (great-expectations#7613) [MAINTENANCE] Deprecate TableExpectation in favor of BatchExpectation (great-expectations#7610) [MAINTENANCE] Deprecate ColumnExpectation in favor of ColumnAggregateExpectation (great-expectations#7609) [DOCS] Correct expectation documentation for expect_column_max_to_be_between (great-expectations#7597) [BUGFIX] Misc gallery bugfixes (great-expectations#7611) [MAINTENANCE] SqlAlchemy 2 Compatibility - `engine.execute()` (great-expectations#7469) [BUGFIX] `dataset_name` made optional parameter for Expectations (great-expectations#7603) [FEATURE] Added AssumeRole Feature (great-expectations#7547) ...
Scope
YAML
configuration representations are consistent with object representation in Fluent Datasources object structures and the requirements forYAML
view of configuration to be human readable, dictionary and JSON representations have contained duplicated"name"
value as the key to object structures. This key is superfluous (and required additional error handling, let alone being logically redundant), because the object structures themselves already contain the"name"
attribute and are thus self-contained. The present effort repairs this duplication.GxConfig
andDatasource
classes (with docstrings included).Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.
Changes proposed in this pull request:
After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.
For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.
In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g.
closes #123
).Previous Design Review notes:
Definition of Done
Please delete options that are not relevant.
Thank you for submitting!