Resolve pipeline datasets matching patterns defined in the catalog #1491

pierre-godard · 2023-08-16T09:57:28Z

Description

When a tracking dataset is defined in the pipeline outputs, but not directly in the data catalog and if it matches a data set pattern defined in the data catalog, then this dataset won't be detected by kedro.

This PR fixes this behavior by adding a dataset resolution step after the catalog and pipelines have been loaded.

Development notes

The demo-project has been updated to use dataset patterns in the tracking layer.
Use of the "resolve" term as this is similar to the resolve command mentioned here: Add new catalog CLI commands for dataset factories kedro#2603

QA notes

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added new entries to the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

ankatiyar · 2023-10-18T15:57:40Z

package/kedro_viz/server.py

+    # Sort data sets by name, then by namespace to display similar data sets together
+    sorted_data_set_names = sorted(
+        data_set_names, key=lambda name: ".".join(reversed(name.split(".")))
+    )


This is probably not needed!

ankatiyar · 2023-10-18T15:59:44Z

package/kedro_viz/server.py

+    for data_set_name in sorted_data_set_names:
+        try:
+            catalog._get_dataset(data_set_name)  # pylint: disable=protected-access
+        except DatasetNotFoundError:
+            continue


Suggested change

for data_set_name in sorted_data_set_names:

try:

catalog._get_dataset(data_set_name) # pylint: disable=protected-access

except DatasetNotFoundError:

continue

for dataset in all_datasets:

catalog.exists(dataset)

This internally calls catalog._get_dataset(dataset) and registers it to the catalog so subsequent catalog.list() calls will have the factory datasets too

ankatiyar · 2023-10-18T16:01:08Z

package/kedro_viz/server.py

+    # Sort data sets by name, then by namespace to display similar data sets together
+    sorted_data_set_names = sorted(
+        data_set_names, key=lambda name: ".".join(reversed(name.split(".")))
+    )


Suggested change

# Sort data sets by name, then by namespace to display similar data sets together

sorted_data_set_names = sorted(

data_set_names, key=lambda name: ".".join(reversed(name.split(".")))

)

ankatiyar · 2023-10-18T16:11:16Z

package/kedro_viz/server.py

+    data_set_names = {
+        data_set_name
+        for pipeline in pipelines.values()
+        for data_set_name in pipeline.data_sets()


ALso just flagging to the team that this method is renamed to pipeline.datasets() on the Kedro develop branch

ravi-kumar-pilla · 2023-10-18T19:56:06Z

Hi @pierre-godard , Thank you for the PR. WIll it be possible for you to discard the files being renamed from the PR and make changes as suggested by @ankatiyar ?

Also we access all our kedro Viz data via the DataAccessManager. Can you please shift the resolve_dataset_factory_patterns method from server.py to package/kedro_viz/data_access/managers.py.

For your reference based on the suggestions -

def resolve_dataset_factory_patterns(
        self, catalog: DataCatalog, pipelines: Dict[str, KedroPipeline]
    ):
        """Resolve dataset factory patterns in data catalog by matching
        them against the datasets in the pipelines.
        """
        for pipeline in pipelines.values():
                if hasattr(pipeline, 'data_sets'):
                     datasets = pipeline.data_sets()
                else:
                     datasets = pipeline.datasets()
                     
            for dataset_name in datasets:
                catalog.exists(dataset_name)

Thank you

ravi-kumar-pilla · 2023-10-25T20:33:05Z

Closing this in favor of #1588

Pierre Godard added 4 commits August 16, 2023 11:27

Rename the data sets of the tracking layer in the demo project

9d3ea1c

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

Resolve data set patterns in populate_data

0aee045

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

Update demo-project catalog to use data set patterns

a6f6541

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

Update RELEASE.md

7de1882

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

pierre-godard mentioned this pull request Aug 16, 2023

No experiment tracking for datasets defined with a dataset factory #1480

Closed

1 task

Pierre Godard and others added 4 commits August 16, 2023 13:55

Organize imports according to the convention

6c6b5be

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

Merge branch 'main' into fix/resolve-data-set-factories

105bad6

Fix DatasetNotFoundError import for older version of kedro

9aaec49

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

Reorganize imports

750014b

Signed-off-by: Pierre Godard <pierre.godard@protonmail.com>

tynandebold added the Community label Sep 18, 2023

ankatiyar reviewed Oct 18, 2023

View reviewed changes

ravi-kumar-pilla mentioned this pull request Oct 18, 2023

Fix dataset factory patterns in Experiment Tracking #1588

Merged

5 tasks

ravi-kumar-pilla closed this Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve pipeline datasets matching patterns defined in the catalog #1491

Resolve pipeline datasets matching patterns defined in the catalog #1491

pierre-godard commented Aug 16, 2023 •

edited by gitpod-io bot

ankatiyar Oct 18, 2023

ankatiyar Oct 18, 2023

ankatiyar Oct 18, 2023

ankatiyar Oct 18, 2023

ankatiyar Oct 18, 2023

ravi-kumar-pilla commented Oct 18, 2023 •

edited

ravi-kumar-pilla commented Oct 25, 2023

Resolve pipeline datasets matching patterns defined in the catalog #1491

Resolve pipeline datasets matching patterns defined in the catalog #1491

Conversation

pierre-godard commented Aug 16, 2023 • edited by gitpod-io bot

Description

Development notes

QA notes

Checklist

ankatiyar Oct 18, 2023

Choose a reason for hiding this comment

ankatiyar Oct 18, 2023

Choose a reason for hiding this comment

ankatiyar Oct 18, 2023

Choose a reason for hiding this comment

ankatiyar Oct 18, 2023

Choose a reason for hiding this comment

ankatiyar Oct 18, 2023

Choose a reason for hiding this comment

ravi-kumar-pilla commented Oct 18, 2023 • edited

ravi-kumar-pilla commented Oct 25, 2023

pierre-godard commented Aug 16, 2023 •

edited by gitpod-io bot

ravi-kumar-pilla commented Oct 18, 2023 •

edited