Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve pipeline datasets matching patterns defined in the catalog #1491

Conversation

pierre-godard
Copy link

@pierre-godard pierre-godard commented Aug 16, 2023

Description

When a tracking dataset is defined in the pipeline outputs, but not directly in the data catalog and if it matches a data set pattern defined in the data catalog, then this dataset won't be detected by kedro.

This PR fixes this behavior by adding a dataset resolution step after the catalog and pipelines have been loaded.

Development notes

QA notes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

Pierre Godard added 4 commits August 16, 2023 11:27
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Pierre Godard and others added 4 commits August 16, 2023 13:55
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>
Comment on lines +48 to +51
# Sort data sets by name, then by namespace to display similar data sets together
sorted_data_set_names = sorted(
data_set_names, key=lambda name: ".".join(reversed(name.split(".")))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not needed!

Comment on lines +53 to +57
for data_set_name in sorted_data_set_names:
try:
catalog._get_dataset(data_set_name) # pylint: disable=protected-access
except DatasetNotFoundError:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for data_set_name in sorted_data_set_names:
try:
catalog._get_dataset(data_set_name) # pylint: disable=protected-access
except DatasetNotFoundError:
continue
for dataset in all_datasets:
catalog.exists(dataset)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This internally calls catalog._get_dataset(dataset) and registers it to the catalog so subsequent catalog.list() calls will have the factory datasets too

Comment on lines +48 to +51
# Sort data sets by name, then by namespace to display similar data sets together
sorted_data_set_names = sorted(
data_set_names, key=lambda name: ".".join(reversed(name.split(".")))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Sort data sets by name, then by namespace to display similar data sets together
sorted_data_set_names = sorted(
data_set_names, key=lambda name: ".".join(reversed(name.split(".")))
)

data_set_names = {
data_set_name
for pipeline in pipelines.values()
for data_set_name in pipeline.data_sets()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALso just flagging to the team that this method is renamed to pipeline.datasets() on the Kedro develop branch

@ravi-kumar-pilla
Copy link
Contributor

ravi-kumar-pilla commented Oct 18, 2023

Hi @pierre-godard , Thank you for the PR. WIll it be possible for you to discard the files being renamed from the PR and make changes as suggested by @ankatiyar ?

Also we access all our kedro Viz data via the DataAccessManager. Can you please shift the resolve_dataset_factory_patterns method from server.py to package/kedro_viz/data_access/managers.py.

For your reference based on the suggestions -

def resolve_dataset_factory_patterns(
        self, catalog: DataCatalog, pipelines: Dict[str, KedroPipeline]
    ):
        """Resolve dataset factory patterns in data catalog by matching
        them against the datasets in the pipelines.
        """
        for pipeline in pipelines.values():
                if hasattr(pipeline, 'data_sets'):
                     datasets = pipeline.data_sets()
                else:
                     datasets = pipeline.datasets()
                     
            for dataset_name in datasets:
                catalog.exists(dataset_name)

Thank you

@ravi-kumar-pilla
Copy link
Contributor

Closing this in favor of #1588

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants