-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve pipeline datasets matching patterns defined in the catalog #1491
Resolve pipeline datasets matching patterns defined in the catalog #1491
Conversation
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
# Sort data sets by name, then by namespace to display similar data sets together | ||
sorted_data_set_names = sorted( | ||
data_set_names, key=lambda name: ".".join(reversed(name.split("."))) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not needed!
for data_set_name in sorted_data_set_names: | ||
try: | ||
catalog._get_dataset(data_set_name) # pylint: disable=protected-access | ||
except DatasetNotFoundError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for data_set_name in sorted_data_set_names: | |
try: | |
catalog._get_dataset(data_set_name) # pylint: disable=protected-access | |
except DatasetNotFoundError: | |
continue | |
for dataset in all_datasets: | |
catalog.exists(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This internally calls catalog._get_dataset(dataset)
and registers it to the catalog so subsequent catalog.list() calls will have the factory datasets too
# Sort data sets by name, then by namespace to display similar data sets together | ||
sorted_data_set_names = sorted( | ||
data_set_names, key=lambda name: ".".join(reversed(name.split("."))) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Sort data sets by name, then by namespace to display similar data sets together | |
sorted_data_set_names = sorted( | |
data_set_names, key=lambda name: ".".join(reversed(name.split("."))) | |
) |
data_set_names = { | ||
data_set_name | ||
for pipeline in pipelines.values() | ||
for data_set_name in pipeline.data_sets() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALso just flagging to the team that this method is renamed to pipeline.datasets()
on the Kedro develop
branch
Hi @pierre-godard , Thank you for the PR. WIll it be possible for you to discard the files being renamed from the PR and make changes as suggested by @ankatiyar ? Also we access all our kedro Viz data via the DataAccessManager. Can you please shift the resolve_dataset_factory_patterns method from server.py to For your reference based on the suggestions - def resolve_dataset_factory_patterns(
self, catalog: DataCatalog, pipelines: Dict[str, KedroPipeline]
):
"""Resolve dataset factory patterns in data catalog by matching
them against the datasets in the pipelines.
"""
for pipeline in pipelines.values():
if hasattr(pipeline, 'data_sets'):
datasets = pipeline.data_sets()
else:
datasets = pipeline.datasets()
for dataset_name in datasets:
catalog.exists(dataset_name) Thank you |
Closing this in favor of #1588 |
Description
When a tracking dataset is defined in the pipeline outputs, but not directly in the data catalog and if it matches a data set pattern defined in the data catalog, then this dataset won't be detected by kedro.
This PR fixes this behavior by adding a dataset resolution step after the catalog and pipelines have been loaded.
Development notes
The
demo-project
has been updated to use dataset patterns in the tracking layer.Use of the "resolve" term as this is similar to the resolve command mentioned here: Add new catalog CLI commands for dataset factories kedro#2603
QA notes
Checklist
RELEASE.md
file