Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kedro catalog rank command #2848

Merged
merged 10 commits into from
Jul 27, 2023
Merged

Conversation

AhdraMeraliQB
Copy link
Contributor

Takes over #2796

Description

This PR introduces the command kedro catalog rank which ranks all dataset factories in the catalog config in the order of how they are matched (i.e. if an explicitly named dataset could match several of the factories in the catalog, it would be resolved with the first on the list.)

The priority is determined as such:

  1. Decreasing specificity (number of characters outside the curly brackets)
  2. Decreasing number of placeholders (number of curly bracket pairs)
  3. Alphabetically

Added note on streamlining the terminology

It looks like dataset factory is equiv to factory pattern, catalog factory, dataset pattern, dataset factory pattern and others that have been used interchangeably. As per #2670 any mentions of a catalog entry that makes use of placeholders will be referred to as a dataset factory. Catalog factories refer to all dataset factories within a specific catalog.

What this means in practice / Why this is useful

As seen in the test case, a catalog is created with the following patterns:

  1. an_example_{place}_{holder},
  2. an_example_{placeholder},
  3. an_{example_placeholder},
  4. on_{example_placeholder},

The explicitly declared an_example_data_set could match any of patterns 1, 2, or 3: an_example_{data}_{set}, an_example_{data_set}, an_{example_data_set}. Priority matching means that it will be resolved with the first pattern.
kedro catalog factories allows users to confirm the order in which factory patterns are considered for matching.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
@AhdraMeraliQB AhdraMeraliQB marked this pull request as ready for review July 27, 2023 09:18
@AhdraMeraliQB AhdraMeraliQB removed the request for review from yetudada July 27, 2023 09:18
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should add these CLI commands to the dataset factories docs too (https://docs.kedro.org/en/latest/data/data_catalog.html#load-multiple-datasets-with-similar-configuration-using-dataset-factories) but that can be done after kedro catalog resolve. Tested the implementation, all good! :)

RELEASE.md Outdated Show resolved Hide resolved
kedro/framework/cli/catalog.py Show resolved Hide resolved
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! ⭐

@merelcht
Copy link
Member

Were the changes in test_cli not necessary anymore? https://github.com/kedro-org/kedro/pull/2796/files

@AhdraMeraliQB
Copy link
Contributor Author

@merelcht

Were the changes in test_cli not necessary anymore? https://github.com/kedro-org/kedro/pull/2796/files

Without exceeding a command depth of three words the changes are no longer necessary. I didn't include them here as I was unsure if there would be any effect on kedro-telemetry. I think the correct action here would be to resolve this duplication instead of refactoring the test on framework-side, when the code itself isn't used.

@merelcht
Copy link
Member

Were the changes in test_cli not necessary anymore? https://github.com/kedro-org/kedro/pull/2796/files

Without exceeding a command depth of three words the changes are no longer necessary. I didn't include them here as I was unsure if there would be any effect on kedro-telemetry. I think the correct action here would be to resolve this duplication instead of refactoring the test on framework-side, when the code itself isn't used.

Got it! Totally agree that we should remove the duplicated code instead, but of course as part of another PR.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for also updating the docs in this PR ⭐

@AhdraMeraliQB AhdraMeraliQB enabled auto-merge (squash) July 27, 2023 13:02
@AhdraMeraliQB AhdraMeraliQB merged commit 9cd8a5b into main Jul 27, 2023
58 of 63 checks passed
@AhdraMeraliQB AhdraMeraliQB deleted the feat/add-kedro-catalog-rank branch July 27, 2023 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add kedro catalog rank command
3 participants