Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Viz compatible with kedro-datasets #1214

Merged
merged 19 commits into from Jan 17, 2023

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented Jan 12, 2023

Description

Fixes #1205

Development notes

  • I've made the matching of import paths more generic so that both datasets from kedro.extras and kedro-datasets get detected correctly
  • Viz will try to import kedro-datasets but if not installed, it will revert to using kedro.extras.datasets

QA notes

I've manually tested this by spinning up the development server as described here: https://github.com/kedro-org/kedro-viz/blob/main/CONTRIBUTING.md#launch-a-development-server-with-a-real-kedro-project

Tried that everything works with kedro==0.18.3 and also with kedro==0.18.4 + kedro-datasets==1.0.1

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
@merelcht merelcht self-assigned this Jan 12, 2023
Copy link
Contributor

@rashidakanchwala rashidakanchwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @merelcht

@tynandebold
Copy link
Member

Don't forget to add a line to the release notes please :)

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the fix! Just a few minor suggestions to make things a bit clearer.

Definitely would be nice to get @jmholzer to review also before merging in case he has any other comments since he's also been thinking about this.

@@ -25,7 +25,7 @@ reporting.feature_importance:
versioned: true

reporting.cancellation_policy_grid:
type: demo_project.extras.datasets.image_dataset.ImageDataSet
type: image_dataset.ImageDataSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason (I think just as an example of how to do it) Joel wanted to define this dataset inside the project itself and not use the kedro built-in one. So we should leave this as it was.

@@ -111,3 +114,9 @@ def load_tracking_data(self, run_id: str):

def get_dataset_type(dataset: AbstractVersionedDataSet) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_dataset_type(dataset: AbstractVersionedDataSet) -> str:
def get_full_dataset_type(dataset: AbstractVersionedDataSet) -> str:
"""e.g. kedro.extras.datasets.plotly.plotly_dataset.PlotlyDataSet or kedro_datasets.plotly.plotly_dataset.PlotlyDataSet"""

Maybe? Sort of depends how dataset_type is used elsewhere as to whether it's worth doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no longer have this. as get_dataset_module_class is the new 'get_dataset_type' e.g. pandas.csv_dataset.CSVDataset

@@ -67,14 +66,15 @@ class TrackingDatasetModel:

dataset_name: str
# dataset is the actual dataset instance, whereas dataset_type is a string.
# e.g. "kedro.extras.datasets.tracking.metrics_dataset.MetricsDataSet"
# e.g. "kedro_datasets.tracking.metrics_dataset.MetricsDataSet"
dataset: AbstractVersionedDataSet
dataset_type: str = field(init=False)
# runs is a mapping from run_id to loaded data.
runs: Dict[str, Any] = field(init=False, default_factory=dict)

def __post_init__(self):
self.dataset_type = get_dataset_type(self.dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable actually still used anywhere now in experiment tracking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried removing this which caused errors. I think it's used for the schema in graphql.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect what's happening here is we're passing the full self.dataset_type to the frontend, which is then just being turned back into the abbreviated form - see https://github.com/kedro-org/kedro-viz/pull/1214/files#r1071164866.

The full dataset type is used in the frontend in metadata panel in the flowchart view but not in experiment tracking (I think). So there's a good simplification we could make here if all we really need in experiment tracking is the full dataset type.

Feel free to leave as it is for now, but should be an easy tidy up for someone to do in a follow up PR if I'm right.

Comment on lines 119 to 122
def get_dataset_module_class(dataset: AbstractVersionedDataSet) -> str:
class_name = f"{dataset.__class__.__qualname__}"
_, dataset_type, dataset_file = f"{dataset.__class__.__module__}".rsplit(".", 2)
return f"{dataset_type}.{dataset_file}.{class_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is kind of confusing given we already call the "full path" dataset_type (which is probably already confusing, but it's also in the flowchart part of kedro-viz source, so probably let's leave it that way for now).

Suggested change
def get_dataset_module_class(dataset: AbstractVersionedDataSet) -> str:
class_name = f"{dataset.__class__.__qualname__}"
_, dataset_type, dataset_file = f"{dataset.__class__.__module__}".rsplit(".", 2)
return f"{dataset_type}.{dataset_file}.{class_name}"
def get_abbreviated_dataset_type(dataset: AbstractVersionedDataSet) -> str:
"""e.g. plotly.plotly_dataset.PlotlyDataSet"""
abbreviated_module_name = ".".join(dataset.__class__.__module__.split(".")[-2:])
return f"{abbreviated_module_name}.{dataset.__class__.__qualname__}"

@@ -466,37 +470,39 @@ def __post_init__(self):
self._get_namespace(self.full_name)
)

@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd refactor this as per suggestions in experiment_tracking.py.

@jmholzer
Copy link
Contributor

Definitely would be nice to get @jmholzer to review also before merging in case he has any other comments since he's also been thinking about this.

I've read through the code and I definitely think the approach is the right one. I'll add a review today 🙂.

const getShortType = (longTypeName, fallback) =>
shortTypeMapping[longTypeName] || fallback;
const getShortType = (name, fallback) => {
const longTypeName = name?.split('.').slice(-3).join('.');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this basically just doing what your new get_abbreviated_dataset_type does? Why don't we just pass the abbreviated dataset type to the frontend instead now we have it in the backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. because nowhere in the front end we show the full name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, i have removed the origin get_dataset_type class and renamed get_dataset_module as get_dataset_type so now everywhere we pass around the shorten dataset name that excludes 'kedro.extras.datasets'

Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
)


def _get_dataset_type(dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a comment here about what this function does, and how and why it is different from the function with the same name in package/kedro_viz/models/flowchart.py, as this may get confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually these two functions are not required in the test files so I have removed it (test_queries, and test _responses). In order to assert the correct result I am going to compare it with the string e.g. "pandas.csv_dataset.CsvDataSet" just the way it was done before.

)


def _get_dataset_type(dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has been declared twice with the same body (here and in test_queries), I am fine with the DAMP approach, but ditto my comment.

@rashidakanchwala rashidakanchwala merged commit 5d6ce10 into main Jan 17, 2023
@rashidakanchwala rashidakanchwala deleted the fix/make-compatible-with-kedro-datasets branch January 17, 2023 09:21
@tynandebold tynandebold mentioned this pull request Jan 18, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Kedro-Viz work with kedro-datasets
5 participants