-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable the usage of generator functions in nodes #2161
Conversation
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are really cool, I love the efficient use of functions from itertools
and more_itertools
.
I left a few comments, all on code style, I couldn't find any logical problems though I haven't tested these changes manually yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a really interesting PR to review! I learned a lot, but also still have a lot of questions 😁
I was also going to comment we should add docs for this and then saw you added an issue already 👍
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Generators cannot refer to themselves in their definition, or they will fail when used. Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks really good now! Great addition 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I really like this PR. The code is in a good state.
Would you like someone to manually test the changes before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Modify the node and run_node to enable generator nodes Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Add tests to cover all types of generator functions Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Fail on running a generator node with async load/save Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Lint my code changes Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Add changelog to RELEASE.md Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Simplify the usage of spy and clarify with a comment Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Improve error messaging * Improve readability slightly in certain places * Correct the expected error message in node run tests Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Revert the eager evaluation of the result in _from_dict Generators cannot refer to themselves in their definition, or they will fail when used. Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Add a comment for the eager evaluation Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Description
Until now Kedro nodes were not very suitable for chunkwise processing of data. While nothing prevents a user to create a dataset loading the data as a generator, the only way one could process that data is by aggregating their result in memory and only then saving it.
This mode robs the chunkwise processing of data of its main use, namely minimising the memory footprint of the processing function. The changes in this PR make it possible to use generator functions in a Kedro pipeline and save the results by calling the dataset's save method separately for each chunk. Now the only thing a user needs to do in their function is:
For taking advantage of this functionality, users need to create custom datasets which would operate in create-or-append-only mode, since the saving of the data. chunks happens over several invocations of the
.save
method. Chunk-wise data loading is already supported by existing Pandas datasets through thechunksize=
load parameter, IIRC.Note that the current implementation doesn't work with
is_async
data loading and saving.Development notes
There are two main changes:
_outputs_to_dictionary
innode.py
_run_node_sequential
inrunner.py
In 1., the main job is to convert a single generator to one or more generators, mapped to dataset names:
In order to manipulate the generators without invoking them, a combination of
itertools
andmore_itertools
functions are used.In 2., the outputs are tested for being iterators, i.e. coming from generators. If that's the case, the generators are interleaved in a round-robin fashion and paired with their corresponding names. Example:
The new iterator will take the first element from each generator and pair them with the dataset name, then it will take the second element from each generator and pair them with their dataset name, and so on, until the generators finish. All generators are ensured to finish at the same time, because they originate from the same function generator.
This means that once the user function
yield
s the data, each part will be saved to it's corresponding dataset, and then the function will continue processing the next chunk,yield
the data, save it, then again process,yield
, save, and so on.Checklist
RELEASE.md
file