-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing Extra Parameters to Custom Dataset #1723
Comments
@brendalf On your specific error, can you try returning a dictionary from your function and constructing the
Since Kedro tries to abstract data saving/loading from logic, I don't think this is directly supported. Of the top of my head, what you could do is return these runtime values from nodes, either explicitly or using hooks to pass that extra output. |
Hi @deepyaman.
Can you provide a short example? |
Are you running this with ParallelRunner? That's a common issue here. |
No. I'm not |
Although I solved the issue by wrapping the data and the parameters inside a class, I think it would be good to have this feature handled by Kedro in the future. |
Hi @brendalf I've just realised this is possibly resolved by tweaking the https://kedro.readthedocs.io/en/latest/_modules/kedro/io/memory_dataset.html |
These errors almost always come from serialization, I think we had similar issue with |
@noklam do you think we could catch this pickling error and recommend the solution? It's a hard one to debug for users in this situation. |
Hi @datajoely |
Sorry - So you can do this by explicitly declaring I also think if you were to subclass our |
Do you think it would be nice to have in the future a way to send runtime calculated values as extra parameters to the dataset? |
@brendalf Could you provide an example of that? |
@brendalf or perhaps - why can't you just return runtime data as inputs to another node, does it need to be in the DataSet implementation? |
My custom dataset needs to receive two things:
Example: I thought about three solutions to solve this:
I actually solved the problem with the first approach, but it's problematic, since now when I want to join nodes together, nodes downstream won't receive the spark dataset plan with lazy evaluation anymore, but a instance of this class. I couldn't find how to implement the second approach. The problem with the third one is that I want to keep using the data catalog. |
Hello folks, any news here? |
I think this option is most common amongst the community:
In Kedro the nodes should be pure python functions with no knowledge of IO, so you should never have a context available there. |
The question of dynamic datasets like these has come up recently in some user conversations. We haven't started thinking on how to do it yet. |
Description
Hello there.
I created a custom dataset to handle our Spark Delta Tables.
The problem is that the custom dataset needs a replace-where string defining what partition should be overwritten after the data is generated inside the node.
Catalog definition:
I can't use the parameters inside the
save_args
key for the custom dataset because the replace values are also calculated during execution depending on other pipeline parameters, like DATE_START and LOOKBACK.I tried to create a class to be the interface between the nodes and the custom catalog, this class holds the Spark Dataframe and extra values, but Kedro fails when trying to convert to a pickle:
Node return:
Custom Dataset save method:
Error received:
Questions:
Edit 1 - 2022-07-25:
The error above was happening because I typed the wrong dataset in the node outputs, so Kedro tried to save as a MemoryDataset.
I solved the problem of sending extra parameters by using this
SparkPlan
wrapper around every save and load from my custom dataset.The text was updated successfully, but these errors were encountered: