Configurable versioning #1871

nickolasrm · 2022-09-26T06:59:15Z

Description

This PR aims to add more customization for VersionedDataSets. There are three main additions made in this PR, the custom format versioning, the customizable version class, and the partial timestamp parsing.

Motivation

Because Kedro can only versionate datasets using a predefined path, the data history structure generated by a previous code that wasn't using Kedro would require to be unnecessarily refactored. Because of that, I tried another approach using PartitionedDataSets, but its logic is not only hard to maintain but is syntactically different than Kedro's declarative YAML idea. For this reason, I wrote this PR to help turning this need into a feature.

Custom format

The first addition enables the use of format codes in the filepath in order to change the default target path of the versioned file.

cars:
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/car_data/%Y/%m/%d/car_data.csv
  versioned: true

The example above dataset would have been translated to data/01_raw/company/car_data/2022/09/25/car_data.csv if today's date was 2022-09-25

Partial timestamp

In order to simplify loading custom versioned datasets, inputting a not fully filled timestamp has also been implemented.

kedro run --load-version "cars:2022-09-25"

or

catalog.load("cars", "2022-09-25")

This is now a possible way of selecting the load version.

Custom version class

If the custom date format is not enough to implement the versioning logic, then the user can subclass the Version class in order to override the default parse and unparse behaviour of the timestamps. For example, let's say you want to represent the day as the Sunday of the week every time you run the code. For that, you could do something like this:

# settings.py
# sunday_version.py
from kedro.io import Version, ProxyDateTime
from datetime import timedelta


class SundayVersion(Version):
    def tosunday(self, version: ProxyDateTime) -> ProxyDateTime:
        dt = version.datetime
        dt = dt - timedelta((dt.weekday() + 1) % 7)
        return ProxyDateTime.from_datetime(dt)

    def parse(self, version_str: str) -> ProxyDateTime:
        date_time = super().parse(version_str)
        return self.tosunday(date_time)

VERSION_CLASS = SundayVersion

Development notes

Version class

Instead of using Version as a namedtuple, Version is now a complete class that helps to parse and to unparse filepaths, becoming the former part of AbstractDataSet that processes timestamps into paths. This was developed for enabling the custom version manipulation logic.

Kept the original behaviour

The default versioning behaviour was kept using the new auxiliar methods is_custom_format and is_unique_date_format of the AbstractDataSet

`UnknownDateTime`

This class was implemented because of the mocks ['first', 'second'] in unit tests. I'm not sure if these non-timestamp formats were only designed for testing or if they are actual features. If it is only used for testing, this class and its handling logic in Version._safe_parse method can be removed, but the unit tests may need to be changed.

Custom `Version` class demands paying attention

Even though a custom Version class can be specified, its parsing, unparsing, and glob methods must be implemented safely in order to not break the internal versioning logic. For instance, the example described before would be considered unique by is_unique_date_format if it implements all ISO format codes. However, because it has changed the %d behaviour, it shouldn't be considered unique. There is a workaround for this problem in the docs, but this is something the user has to pay attention. Also, because unparsing is called multiple times inside the code, the pattern can't be easily manipulated. For example, if the user wants the unparse to always add the date at the end of the filepath the user has to be careful in order to not add it multiple times (because of the internal logic). These are some examples of this setting limitation.

Note: This customization of the datetime logic is very important for the use case I intend to use. I need the exact behaviour of the example, haha.

Unit tests

Wrote unit tests for all kedro.io.date_time classes, and their methods aiming to reproduce their caller's expectations present in other parts of the code.

Wrote unit tests in test_data_catalog for testing new warnings and if the files created by datasets using custom versioning were loading and saving correctly.

None of the already present tests were changed in order to make sure the default behaviour was preserved.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

noklam · 2022-09-26T11:17:40Z

Thank you @nickolasrm for raising this PR, impressive one. This is a big PR and I think it deserves some discussion and possibly breaking down into smaller pieces. I like that Version is potentially a configurable class in settings.py.

This PR is interesting as this is the 1st attempt I see someone trying to subclass Version. At the moment, the version string is quite important as it is also a session_id and needs to be a unique sortable string. We have some discussion about whether we should allow users to customize it #1551.

Do I understand it correctly the motivation of this PR is to make Kedro recognize datasets that were generated outside of Kedro / pre-existing data?

What happens if you are using only YYYY-mm-dd and there are existing datasets, does kedro run still works?

Side-note:
Potentially we may move this into technical design, but before that, I think we need some more clarification and to understand the issue better.

datajoely · 2022-09-26T13:54:48Z

I just want to add this is great work @nickolasrm we just need to think about how it affects some other moving pieces!

nickolasrm · 2022-09-26T15:09:42Z

Thank you @nickolasrm for raising this PR, impressive one. This is a big PR and I think it deserves some discussion and possibly breaking down into smaller pieces. I like that Version is potentially a configurable class in settings.py.

This PR is interesting as this is the 1st attempt I see someone trying to subclass Version. At the moment, the version string is quite important as it is also a session_id and needs to be a unique sortable string. We have some discussion about whether we should allow users to customize it #1551.

Do I understand it correctly the motivation of this PR is to make Kedro recognize datasets that were generated outside of Kedro / pre-existing data?

What happens if you are using only YYYY-mm-dd and there are existing datasets, does kedro run still works?

Side-note: Potentially we may move this into technical design, but before that, I think we need some more clarification and to understand the issue better.

I'm very happy you liked this change. I find everyone implementing the best mlops practices should use Kedro, but I also think for older projects or for companies that are in the early stages of mlops possibly have a different versioning system than Kedro's. In my case, not only do the old projects depend on the current versioned folder structure but other services connected to this data too.

About the motivation, yes, we want to ingest old data generated by projects that didn't use Kedro at that moment. However, not only reading old data is necessary, but we'd like to keep our folder structure the way it is after rebuilding the project with Kedro. The partitioned dataset approach I mentioned in my first comment was made by prefixing partitions with the custom version format, but as I said, this is not a very maintainable approach.

About the YYYY-mm-dd, the loss of the unique date format is something we can't avoid if the user chooses to do it. Actually, some of our projects require overwriting existing data. For example, when we want to ratify some monthly prediction results. I've added a validation in AbstractVersionedDataSet for verifying when a pattern is not unique, and if it isn't, Kedro won't throw an error, but instead, an overwrite warning.

About the discussion: Very nice to know this was already discussed! I didn't see the thread before, but I think the problem with the session_id may be resolved by decoupling custom Version classes from session_id generation. Maybe I'm suggesting something stupid, but the default Version generator could be something apart from these settings, couldn't it?

noklam · 2022-10-06T21:07:17Z

Sorry for the delayed response, I missed the notification from GitHub, feel free to @mention again if the thread becomes stale for too long.

A related thread #1551 about session_id.

From my understanding, the version needs to be unique for two reasons.

At the moment, session_id = version - it is not necessary, but a reasonable default as it is. The main usage of the session_id is for experiment tracking
It needs to be a unique stamp to avoid overwriting

@AntonyMilneQB @MerelTheisenQB may have more thought on this?

merelcht · 2023-02-22T16:10:44Z

Hi @nickolasrm, thanks for raising this PR! This is quite a big feature, so the maintainer team would like to discuss and dedicate time to this properly. It might take a while for us to get to it, but know that we haven't forgotten this 🙂 I will convert this into an issue (#2355) and close the PR to keep our repo up to date. I hope that's okay!

nickolasrm and others added 12 commits September 20, 2022 20:28

added base templated versioning

f231ca2

Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>

added partial datetime parser

cd33a37

Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>

Added customizable version class

2fd7c10

Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>

fixed linting and added versioner

c62dfe5

fixed bugs from tests and linting

0cd3561

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

improved readability of io.core module

c3adb9f

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

added tests for date_time module and specialized its classes

766f628

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

added tests for customized versioning and resolved bugs

f0a9da8

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

added docs for custom versioning

be1ad1d

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

Merge branch 'kedro-org:main' into feature/configurable_versioned

60a1a25

added custom versioning into release.md

c6b66f3

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

moved glob from abstractdataset to version

d772900

Signed-off-by: nickolasrm <nickolas.machado@ipiranga.ipiranga>

nickolasrm requested review from yetudada and idanov as code owners September 26, 2022 06:59

noklam added the Community Issue/PR opened by the open-source community label Sep 26, 2022

noklam mentioned this pull request Oct 28, 2022

How can we improve dataset versioning? #1979

Open

merelcht mentioned this pull request Feb 22, 2023

Configurable versioning #2355

Open

merelcht closed this Feb 22, 2023

noklam mentioned this pull request Jun 10, 2024

Conduct market research on versioning #3933

Open

merelcht added this to the Dataset Versioning milestone Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable versioning #1871

Configurable versioning #1871

nickolasrm commented Sep 26, 2022 •

edited

Loading

noklam commented Sep 26, 2022 •

edited

Loading

datajoely commented Sep 26, 2022

nickolasrm commented Sep 26, 2022 •

edited

Loading

noklam commented Oct 6, 2022

merelcht commented Feb 22, 2023

Configurable versioning #1871

Configurable versioning #1871

Conversation

nickolasrm commented Sep 26, 2022 • edited Loading

Description

Motivation

Custom format

Partial timestamp

Custom version class

Development notes

Version class

Kept the original behaviour

UnknownDateTime

Custom Version class demands paying attention

Unit tests

Checklist

noklam commented Sep 26, 2022 • edited Loading

datajoely commented Sep 26, 2022

nickolasrm commented Sep 26, 2022 • edited Loading

noklam commented Oct 6, 2022

merelcht commented Feb 22, 2023

nickolasrm commented Sep 26, 2022 •

edited

Loading

`UnknownDateTime`

Custom `Version` class demands paying attention

noklam commented Sep 26, 2022 •

edited

Loading

nickolasrm commented Sep 26, 2022 •

edited

Loading