-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier CustomDataset Creation #1936
Comments
Hi @lancechua thank you for raising this! We have discussed this as a team and have had a good conversation regarding ways we can proceed. You will see that this issue has now been referenced in the #1778 super issue which is tracking all of the ways we plan to improve the Your last point:
Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original The team are going to have a go at prototyping what this could look like, so you may be interested in subscribing to parent issue to contribute to these design decisions. The initial thinking reflected on the rather flat structure we have today where everything inherits from |
I feel like too much copy-pasting has been an issue for a really long time. 😅 I would be keen to break down the points of duplication (e.g. default argument handling, |
Collecting some thoughts from our latest Tech Design session about this topic: Recent developments
General thoughts
Current issues
Other links and ideas
|
A user that quite likely got confused when defining their own dataset, most likely because they aren't copy-pasting enough https://www.linen.dev/s/kedro/t/12646523/hi-folks-i-m-trying-to-create-a-custom-kedro-dataset-inherit#ac22d4cb-dd72-4072-905a-f9c5cb352205 |
I'm also just dropping data from some of our surveys here, the question is "What is the hardest part of working with the Kedro framework?" (Kedro - User Research Study)
|
Interesting use case: a user trying to perform data access with a highly customized This is not too different from my |
Channelling my inner @idanov this does harm reproducibility - personally I think it's a good thing, same for conditional nodes, another things we've held off doing for the same reason. |
This is a bit of a tangent, but I think it's worth addressing. The reality is that people who need "weird" things will do them, whether are supported by Kedro or not. This not only applies to datasets: I've seen people do very weird things with Kedro. Speaking of datasets, the options our users have are:
There are 2 stances we can take:
I'm firmly in camp (2). |
Completely aligned on #2 I think it's related to the wider rigidity questions we have in the framework vs lib debate. FYI - how prefect do conditionals here looks neat |
Another user who wants to read a TSV file contained in a pandas does not support compressed archives with more than one file, so this requires an intermediate step. |
After implementing some new datasets that will hopefully be open sourced soon, I'm adding some more reflections here. I think that the difficult part lies in three specific points:
(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes (2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the Another reason why this is hard is the naming itself: some things we name "datasets" in Kedro are not really a /ˈdā-tə-ˌset/ in the strict sense of the word, but more like "artifacts" or "connectors". I think there is an opportunity here to collect some of the "weird" use cases we already have and come up with better terminology. |
Can't remember who linked it in the issues here somewhere, but after reading the Python subclassing redux article, I also think composition is the better way to go by injecting
Also agree. The callables can be fully specified by From what I remember about the implementation of versioning, the filename becomes a directory, with timestamp sub-directories. A separate A use case that would be good, but potentially annoying to implement, is a
The The
The no caching case would just always dispatch to the original P.S. Haven't worked on kedro datasets for a while now, but hopefully the above still makes sense. |
Thanks a lot for getting back @lancechua! Yes, Hynek has influenced me a lot on this as well. Hope we can prioritize this in 2024. |
More evidence of different artifact types: Kubeflow https://www.kubeflow.org/docs/components/pipelines/v2/data-types/artifacts/#artifact-types (via @deepyaman) |
Another problem with the current design: when doing post-mortem debugging on datasets, one gets thrown to Lines 191 to 201 in 4d186ef
|
Now Python 3.7 is EOL is there any advantage to using Protocol types as an alternative/addition to the abstract classes? |
I proposed it in another issue and @antonymilne made some good remarks #2409 (comment) I could be wrong here and in any case I defer the implementation details to the engineering team, but I don't think this can be done without "breaking" the current catalog + abstractdataset model. This would essentially be a "catalog 2.0", which is exciting but also kind of scary. |
A user confused about what |
On that last part - we're well overdue a native PyTorch dataset |
Another thing to note: some of our underlying datasets have their own mechanisms of reading remote data that don't rely on fsspec pola-rs/polars#13044 plus sometimes our "copy paste" fsspec logic is really obscure and easy to get wrong kedro-org/kedro-plugins#360 (comment) Making our "fsspec adapter" (not a mixin please) more explicit will make life easier for custom dataset authors and probably for us. |
Progress towards making
Originally posted by @deepyaman in #3920 (comment) |
Also, interesting perspective on object stores vs filesystems: https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface
Originally shared by @MatthiasRoels in kedro-org/kedro-plugins#625 (comment) |
Description
IO read write functions typically follow this signature:
Creating custom datasets should ideally be as easy as supplying
load
/save
function(s) that follow this function signature.Context
kedro
supports an extensive range of datasets, but it is not exhaustive. Popular libraries used by relatively niche communities likexarray
andarviz
aren't currently supported.Beyond these examples, unofficially adding support for more obscure datasets would be easier.
Initially, I was looking to implement something like this and asked in the Discord chat if this pattern made sense.
Then, @datajoely suggested I open a feature request.
Possible Implementation
We can consider a
Dataset
class factory. MaybeGenericDataset
with a class method.create_custom_dataset(name: str, load: callable, save: callable)
.Usage would look something like
xarrayNetCDF = GenericDataset("xarrayNetCDF", xarray.open_dataset, lambda x, path, **kwargs: x.to_netcdf(path, **kwargs))
.Entries can be added to the data catalog yaml just as with any other custom dataset implementation.
Possible Alternatives
LambdaDataset
is very similar but theload
, andsave
are hard coded in the implementation, and cannot be parameterized in the data catalog, as far as I'm awareAbstractDataset
is an option, but this feature request seeks to reduce boilerplate when defining new datasetsThe text was updated successfully, but these errors were encountered: