Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable versioning #1871

Closed

Conversation

nickolasrm
Copy link
Contributor

@nickolasrm nickolasrm commented Sep 26, 2022

Description

This PR aims to add more customization for VersionedDataSets. There are three main additions made in this PR, the custom format versioning, the customizable version class, and the partial timestamp parsing.

Motivation

Because Kedro can only versionate datasets using a predefined path, the data history structure generated by a previous code that wasn't using Kedro would require to be unnecessarily refactored. Because of that, I tried another approach using PartitionedDataSets, but its logic is not only hard to maintain but is syntactically different than Kedro's declarative YAML idea. For this reason, I wrote this PR to help turning this need into a feature.

Custom format

The first addition enables the use of format codes in the filepath in order to change the default target path of the versioned file.

cars:
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/car_data/%Y/%m/%d/car_data.csv
  versioned: true

The example above dataset would have been translated to data/01_raw/company/car_data/2022/09/25/car_data.csv if today's date was 2022-09-25

Partial timestamp

In order to simplify loading custom versioned datasets, inputting a not fully filled timestamp has also been implemented.

kedro run --load-version "cars:2022-09-25"

or

catalog.load("cars", "2022-09-25")

This is now a possible way of selecting the load version.

Custom version class

If the custom date format is not enough to implement the versioning logic, then the user can subclass the Version class in order to override the default parse and unparse behaviour of the timestamps. For example, let's say you want to represent the day as the Sunday of the week every time you run the code. For that, you could do something like this:

# settings.py
# sunday_version.py
from kedro.io import Version, ProxyDateTime
from datetime import timedelta


class SundayVersion(Version):
    def tosunday(self, version: ProxyDateTime) -> ProxyDateTime:
        dt = version.datetime
        dt = dt - timedelta((dt.weekday() + 1) % 7)
        return ProxyDateTime.from_datetime(dt)

    def parse(self, version_str: str) -> ProxyDateTime:
        date_time = super().parse(version_str)
        return self.tosunday(date_time)

VERSION_CLASS = SundayVersion

Development notes

Version class

Instead of using Version as a namedtuple, Version is now a complete class that helps to parse and to unparse filepaths, becoming the former part of AbstractDataSet that processes timestamps into paths. This was developed for enabling the custom version manipulation logic.

Kept the original behaviour

The default versioning behaviour was kept using the new auxiliar methods is_custom_format and is_unique_date_format of the AbstractDataSet

UnknownDateTime

This class was implemented because of the mocks ['first', 'second'] in unit tests. I'm not sure if these non-timestamp formats were only designed for testing or if they are actual features. If it is only used for testing, this class and its handling logic in Version._safe_parse method can be removed, but the unit tests may need to be changed.

Custom Version class demands paying attention

Even though a custom Version class can be specified, its parsing, unparsing, and glob methods must be implemented safely in order to not break the internal versioning logic. For instance, the example described before would be considered unique by is_unique_date_format if it implements all ISO format codes. However, because it has changed the %d behaviour, it shouldn't be considered unique. There is a workaround for this problem in the docs, but this is something the user has to pay attention. Also, because unparsing is called multiple times inside the code, the pattern can't be easily manipulated. For example, if the user wants the unparse to always add the date at the end of the filepath the user has to be careful in order to not add it multiple times (because of the internal logic). These are some examples of this setting limitation.

Note: This customization of the datetime logic is very important for the use case I intend to use. I need the exact behaviour of the example, haha.

Unit tests

Wrote unit tests for all kedro.io.date_time classes, and their methods aiming to reproduce their caller's expectations present in other parts of the code.

Wrote unit tests in test_data_catalog for testing new warnings and if the files created by datasets using custom versioning were loading and saving correctly.

None of the already present tests were changed in order to make sure the default behaviour was preserved.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

nickolasrm and others added 12 commits September 20, 2022 20:28
Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>
Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>
Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>
Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>
Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>
Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>
Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>
Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>
Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>
Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>
@noklam
Copy link
Contributor

noklam commented Sep 26, 2022

Thank you @nickolasrm for raising this PR, impressive one. This is a big PR and I think it deserves some discussion and possibly breaking down into smaller pieces. I like that Version is potentially a configurable class in settings.py.

This PR is interesting as this is the 1st attempt I see someone trying to subclass Version. At the moment, the version string is quite important as it is also a session_id and needs to be a unique sortable string. We have some discussion about whether we should allow users to customize it #1551.

Do I understand it correctly the motivation of this PR is to make Kedro recognize datasets that were generated outside of Kedro / pre-existing data?

What happens if you are using only YYYY-mm-dd and there are existing datasets, does kedro run still works?

Side-note:
Potentially we may move this into technical design, but before that, I think we need some more clarification and to understand the issue better.

@datajoely
Copy link
Contributor

I just want to add this is great work @nickolasrm we just need to think about how it affects some other moving pieces!

@noklam noklam added the Community Issue/PR opened by the open-source community label Sep 26, 2022
@nickolasrm
Copy link
Contributor Author

nickolasrm commented Sep 26, 2022

Thank you @nickolasrm for raising this PR, impressive one. This is a big PR and I think it deserves some discussion and possibly breaking down into smaller pieces. I like that Version is potentially a configurable class in settings.py.

This PR is interesting as this is the 1st attempt I see someone trying to subclass Version. At the moment, the version string is quite important as it is also a session_id and needs to be a unique sortable string. We have some discussion about whether we should allow users to customize it #1551.

Do I understand it correctly the motivation of this PR is to make Kedro recognize datasets that were generated outside of Kedro / pre-existing data?

What happens if you are using only YYYY-mm-dd and there are existing datasets, does kedro run still works?

Side-note: Potentially we may move this into technical design, but before that, I think we need some more clarification and to understand the issue better.

I'm very happy you liked this change. I find everyone implementing the best mlops practices should use Kedro, but I also think for older projects or for companies that are in the early stages of mlops possibly have a different versioning system than Kedro's. In my case, not only do the old projects depend on the current versioned folder structure but other services connected to this data too.

About the motivation, yes, we want to ingest old data generated by projects that didn't use Kedro at that moment. However, not only reading old data is necessary, but we'd like to keep our folder structure the way it is after rebuilding the project with Kedro. The partitioned dataset approach I mentioned in my first comment was made by prefixing partitions with the custom version format, but as I said, this is not a very maintainable approach.

About the YYYY-mm-dd, the loss of the unique date format is something we can't avoid if the user chooses to do it. Actually, some of our projects require overwriting existing data. For example, when we want to ratify some monthly prediction results. I've added a validation in AbstractVersionedDataSet for verifying when a pattern is not unique, and if it isn't, Kedro won't throw an error, but instead, an overwrite warning.

About the discussion: Very nice to know this was already discussed! I didn't see the thread before, but I think the problem with the session_id may be resolved by decoupling custom Version classes from session_id generation. Maybe I'm suggesting something stupid, but the default Version generator could be something apart from these settings, couldn't it?

@noklam
Copy link
Contributor

noklam commented Oct 6, 2022

Sorry for the delayed response, I missed the notification from GitHub, feel free to @mention again if the thread becomes stale for too long.

A related thread #1551 about session_id.

From my understanding, the version needs to be unique for two reasons.

  1. At the moment, session_id = version - it is not necessary, but a reasonable default as it is. The main usage of the session_id is for experiment tracking
  2. It needs to be a unique stamp to avoid overwriting

@AntonyMilneQB @MerelTheisenQB may have more thought on this?

@merelcht
Copy link
Member

Hi @nickolasrm, thanks for raising this PR! This is quite a big feature, so the maintainer team would like to discuss and dedicate time to this properly. It might take a while for us to get to it, but know that we haven't forgotten this 🙂 I will convert this into an issue (#2355) and close the PR to keep our repo up to date. I hope that's okay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants