Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier CustomDataset Creation #1936

Open
lancechua opened this issue Oct 16, 2022 · 22 comments
Open

Easier CustomDataset Creation #1936

lancechua opened this issue Oct 16, 2022 · 22 comments

Comments

@lancechua
Copy link

Description

IO read write functions typically follow this signature:

def load(path: Union[Path, str], **kwars) -> obj:
    ...

def save(obj, path: Union[Path, str], **kwars) -> None:
    ...

Creating custom datasets should ideally be as easy as supplying load / save function(s) that follow this function signature.

Context

kedro supports an extensive range of datasets, but it is not exhaustive. Popular libraries used by relatively niche communities like xarray and arviz aren't currently supported.

Beyond these examples, unofficially adding support for more obscure datasets would be easier.

Initially, I was looking to implement something like this and asked in the Discord chat if this pattern made sense.
Then, @datajoely suggested I open a feature request.

Possible Implementation

We can consider a Dataset class factory. Maybe GenericDataset with a class method .create_custom_dataset(name: str, load: callable, save: callable).

Usage would look something like xarrayNetCDF = GenericDataset("xarrayNetCDF", xarray.open_dataset, lambda x, path, **kwargs: x.to_netcdf(path, **kwargs)).
Entries can be added to the data catalog yaml just as with any other custom dataset implementation.

Possible Alternatives

  • LambdaDataset is very similar but the load, and save are hard coded in the implementation, and cannot be parameterized in the data catalog, as far as I'm aware
  • Subclassing AbstractDataset is an option, but this feature request seeks to reduce boilerplate when defining new datasets
  • Adding xarray support #1346 officially requires implementing nuances like cloud file storage, partitioned datasets, lazy loading, etc.
@lancechua lancechua added the Issue: Feature Request New feature or improvement to existing feature label Oct 16, 2022
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Oct 17, 2022
@datajoely
Copy link
Contributor

Hi @lancechua thank you for raising this! We have discussed this as a team and have had a good conversation regarding ways we can proceed.

You will see that this issue has now been referenced in the #1778 super issue which is tracking all of the ways we plan to improve the DataCatalog at a low level.

Your last point:

Adding xarray support kedro-org/kedro-plugins#165 officially requires implementing nuances like cloud file storage, partitioned datasets, lazy loading, etc.

Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original AbstractDataSet class definition, implementing a new DataSet today requires (probably too much) copy and pasting.

The team are going to have a go at prototyping what this could look like, so you may be interested in subscribing to parent issue to contribute to these design decisions. The initial thinking reflected on the rather flat structure we have today where everything inherits from AbstractDataSet (or AbstractVersionedDataSet) and we could utilise OO concepts much better and make implementation/testing/overriding much simpler.

@deepyaman
Copy link
Member

deepyaman commented Oct 31, 2022

Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original AbstractDataSet class definition, implementing a new DataSet today requires (probably too much) copy and pasting.

I feel like too much copy-pasting has been an issue for a really long time. 😅

I would be keen to break down the points of duplication (e.g. default argument handling, fsspec-related boilerplate, etc.) and add mixins for each (as in #15), because that way we can make progress on resolving these pain points without having to limit ourselves to a universal solution that works for every possible dataset.

@merelcht merelcht added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation and removed Community Issue/PR opened by the open-source community labels Apr 12, 2023
@astrojuanlu
Copy link
Member

Collecting some thoughts from our latest Tech Design session about this topic:

Recent developments

General thoughts

Current issues

  • Lots of copy-pasting
  • Unclear how to support "lazy" datasets
    • Case in point: HDF, FITS, others
    • Datasets are eager, so if the data has to be lazily loaded, users have to jump through some hoops
  • Somewhat unclear separation of "format" and "transport"
    • Not really an issue of the current design, but we could do a better job at communicating it, or just assume that people will do whatever works for them
  • Unclear if full reproducibility is encouraged
    • Case in point: Kaggle
    • Another: "My experience loading COVID data https://github.com/deepyaman/inauditus/blob/develop/refresh-data I just write a script to download the data from an external source, and then treat it as a pandas.CSVDataSet. There are other problems with doing the “expensive” data download in a load function, e.g. that we’re loading from a remote source each time when we really should just have it in a downloaded source."
  • Unclear interaction with versioning
    • "And it is more subtle about versioning - it actually works only if you use a specific function. The newly ManagedTableDataset which use the built-in versioning actually doesn’t work like this."
  • Creating a custom dataset requires a full Kedro installation, which some users consider too heavy

Other links and ideas

@astrojuanlu
Copy link
Member

A user that quite likely got confused when defining their own dataset, most likely because they aren't copy-pasting enough https://www.linen.dev/s/kedro/t/12646523/hi-folks-i-m-trying-to-create-a-custom-kedro-dataset-inherit#ac22d4cb-dd72-4072-905a-f9c5cb352205

@yetudada
Copy link
Contributor

yetudada commented Aug 7, 2023

I'm also just dropping data from some of our surveys here, the question is "What is the hardest part of working with the Kedro framework?" (Kedro - User Research Study)

  • IO abstraction, I found its bit tricky if I want to implement any new IO module
  • when I have to work on non-popular data sets such as Excel
  • Writing custom IO datasets. Specially Incremental Data Ingestion. Finding rationale behind different kedro layer (raw, stage, int etc) specifically when to use which and when to skip.
  • Custom datasets perhaps

@astrojuanlu
Copy link
Member

Interesting use case: a user trying to perform data access with a highly customized wget command https://linen-slack.kedro.org/t/14156294/i-m-migrating-to-kedro-for-a-bunch-of-my-projects-for-one-of#a3d857e1-7e31-48f9-98de-439a936786c8

This is not too different from my KaggleDataset attempt. Maybe there should be a "generic dataset" that allows people to just type a subprocess that will take care of leaving some files in a directory.

@datajoely
Copy link
Contributor

Channelling my inner @idanov this does harm reproducibility - personally I think it's a good thing, same for conditional nodes, another things we've held off doing for the same reason.

@astrojuanlu
Copy link
Member

This is a bit of a tangent, but I think it's worth addressing.

The reality is that people who need "weird" things will do them, whether are supported by Kedro or not. This not only applies to datasets: I've seen people do very weird things with Kedro.

Speaking of datasets, the options our users have are:

  • Put the "non reproducible" part to a separate step, and build the rest as a Kedro project. (What @deepyaman did in his project, and what the user I referred to in my previous comment declared they would do). I contend this is bad DX.
  • Hack around their own brittle dataset that does this. Which is a difficult thing to do in any case (see this issue). So not only this is bad DX, but also they're probably shooting themselves in the foot and we're not helping.
  • Find an alternative way, assuming it's possible. This is good for our users enlightenment, but still screams "you're doing it wrong", which is something that should be handled with care from the Kedro side (with, at the very least, good docs that are easily discoverable). I'm on the fence on whether this is bad DX or not - frameworks definitely impose some constraints. But should they have escape hatches?
  • Say "Kedro is not for me" and try something else. This is bad for Kedro.

There are 2 stances we can take:

  1. Consider the bad DX a feature. Friction means the users will try to do something else. Or, put another way, what they're trying to do is out of scope for Kedro so we refuse to implement it ourselves.
  2. Consider the bad DX a bug. Friction means we should improve the situation for these users, while helping them do the right thing as much as we can, but offering options to do all their data processing within Kedro.

I'm firmly in camp (2).

@datajoely
Copy link
Contributor

Completely aligned on #2 I think it's related to the wider rigidity questions we have in the framework vs lib debate.

FYI - how prefect do conditionals here looks neat
PrefectHQ/prefect#3545 (comment)

@astrojuanlu
Copy link
Member

Another user who wants to read a TSV file contained in a .tar.gz https://stackoverflow.com/q/76994906/554319

pandas does not support compressed archives with more than one file, so this requires an intermediate step.

@astrojuanlu
Copy link
Member

astrojuanlu commented Sep 13, 2023

After implementing some new datasets that will hopefully be open sourced soon, I'm adding some more reflections here.

I think that the difficult part lies in three specific points:

  1. Having to overload private methods of an obscure class (that then don't get documented, see any dataset in https://docs.kedro.org/en/0.18.13/kedro_datasets.html),
  2. The fsspec boilerplate, and
  3. The versioning stuff (How can we improve dataset versioning? #1979, Configurable versioning #2355)

(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes load and save public methods. Moreover, I'm advocating to favor composition over inheritance.

(2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the path to load as @lancechua suggested might be a too strong assumption.

Another reason why this is hard is the naming itself: some things we name "datasets" in Kedro are not really a /ˈdā-tə-ˌset/ in the strict sense of the word, but more like "artifacts" or "connectors". I think there is an opportunity here to collect some of the "weird" use cases we already have and come up with better terminology.

@lancechua
Copy link
Author

lancechua commented Sep 16, 2023

(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes load and save public methods. Moreover, I'm advocating to favor composition over inheritance.

Can't remember who linked it in the issues here somewhere, but after reading the Python subclassing redux article, I also think composition is the better way to go by injecting load and save callables as dependencies.

(2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the path to load as @lancechua suggested might be a too strong assumption.

Also agree. The callables can be fully specified by load_kwargs (and potentially load_args) so there isn't a hard requirement for the function signature.

From what I remember about the implementation of versioning, the filename becomes a directory, with timestamp sub-directories. A separate versioner dependency can be injected to redirect the load and save callables to the path of the correct version. This would make the path parameter a hard requirement, but we can always just pass None for "pathless" datasets.
The no versioning case would just return the path as is.

A use case that would be good, but potentially annoying to implement, is a cache for expensive "load" operations like web pulls or API requests. Instead of repeating the expensive request, the flow with subsequent reads would be:

expensive first request -> in memory -> save to disk -> read cache from disk -> in memory -> ...

The cache can be a dispatcher that takes both the original load callable, and a load cache callable.
cache itself must still be a valid load callable.
Perhaps we namespace original and cache load params or create a separate load_cache_kwargs parameter?

The load callable dispatched would depend on:

  • dataset parameters (e.g. use_cache=True)
  • cache validity (e.g. based on hash of params).

The no caching case would just always dispatch to the original load callable.

P.S. Haven't worked on kedro datasets for a while now, but hopefully the above still makes sense.

@astrojuanlu
Copy link
Member

Thanks a lot for getting back @lancechua! Yes, Hynek has influenced me a lot on this as well. Hope we can prioritize this in 2024.

@astrojuanlu
Copy link
Member

More evidence of different artifact types: Kubeflow https://www.kubeflow.org/docs/components/pipelines/v2/data-types/artifacts/#artifact-types (via @deepyaman)

@astrojuanlu
Copy link
Member

Another problem with the current design: when doing post-mortem debugging on datasets, one gets thrown to AbstractDataset.load:

kedro/kedro/io/core.py

Lines 191 to 201 in 4d186ef

try:
return self._load()
except DatasetError:
raise
except Exception as exc:
# This exception handling is by design as the composed data sets
# can throw any type of exception.
message = (
f"Failed while loading data from data set {str(self)}.\n{str(exc)}"
)
raise DatasetError(message) from exc

@datajoely
Copy link
Contributor

Now Python 3.7 is EOL is there any advantage to using Protocol types as an alternative/addition to the abstract classes?

@astrojuanlu
Copy link
Member

I proposed it in another issue and @antonymilne made some good remarks #2409 (comment) I could be wrong here and in any case I defer the implementation details to the engineering team, but I don't think this can be done without "breaking" the current catalog + abstractdataset model. This would essentially be a "catalog 2.0", which is exciting but also kind of scary.

@astrojuanlu
Copy link
Member

@datajoely
Copy link
Contributor

On that last part - we're well overdue a native PyTorch dataset

@astrojuanlu
Copy link
Member

Another thing to note: some of our underlying datasets have their own mechanisms of reading remote data that don't rely on fsspec pola-rs/polars#13044 plus sometimes our "copy paste" fsspec logic is really obscure and easy to get wrong kedro-org/kedro-plugins#360 (comment)

Making our "fsspec adapter" (not a mixin please) more explicit will make life easier for custom dataset authors and probably for us.

@astrojuanlu
Copy link
Member

Progress towards making load and save public:

Edit: Actually, thinking about it more, should just make load and save abstract instead of _load and _save, to make the new interface clear. Even if just _load and _save is being implemented, you mentioned it's going into __init_subclass__, and load and save will get created there.

Originally posted by @deepyaman in #3920 (comment)

@astrojuanlu
Copy link
Member

Also, interesting perspective on object stores vs filesystems: https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface

This design provides the following advantages:

  • All operations are atomic, and readers cannot observe partial and/or failed writes
  • Methods map directly to object store APIs, providing both efficiency and predictability
  • Abstracts away filesystem and operating system specific quirks, ensuring portability
  • Allows for functionality not native to filesystems, such as operation preconditions and atomic multipart uploads

Originally shared by @MatthiasRoels in kedro-org/kedro-plugins#625 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Near term
Development

No branches or pull requests

6 participants