-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide way to process partition datasets one partition at a time #1413
Comments
|
@deepyaman could you provide a snippet? This would be much simpler than my proposal and I'd love to include it in the docs |
I'll create/share an example later today. |
@datajoely Upon looking, there's already an example at the bottom of https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset-save. I was going to create a "more realistic/relevant" example (image transpositions as a preprocessing step), but let me know if that's necessary now that we see there's already an example. |
@deepyaman I'd love to see one, it's actually coming up multiple times a week on discord :) |
Here is an example I made that worked for a node that inputted and outputted a partition: def preprocess_all_data(partitioned_input: Dict[str, Callable[[], Any]]) -> Dict[str, Callable[[], Any]]:
return {
key: (lambda: _preprocess_partion(load_func())) for key, load_func in partitioned_input.items()
} Here you are using a function |
I'm actually closing this as I didn't realise there was a neat way of doing it! |
A late follow up but I have been trying to implement this but facing a weird error. Using your example, load_func() will always end up loading the last file in my folder. def preprocess_all_data(partitioned_input: Dict[str, Callable[[], Any]]) -> Dict[str, Callable[[], Any]]:
return {
key: (lambda load_func=load_func: _preprocess_partion(load_func())) for key, load_func in partitioned_input.items()
} The issue is related to Python's late binding closures, which can cause the lambda function to use the value of load_func as it exists at the end of the loop, rather than its value at each iteration. This is why all the lambda functions end up referring to the same load_func and hence, resulting in the same DataFrame every time I called them. Hope this help anyone else who face this issue. Spent a day debugging this. |
Amazing work @auggie246 we need to get this in the docs ASAP! |
For anyone trying to spot the diff above def preprocess_all_data(
partitioned_input: Dict[str, Callable[[], Any]],
) -> Dict[str, Callable[[], Any]]:
return {
- key: (lambda: _preprocess_partion(load_func()))
+ key: (lambda load_func=load_func: _preprocess_partion(load_func()))
for key, load_func in partitioned_input.items()
} |
Description
PartitionedDataSet
today provides a lazy method for loading each partition in a memory efficient way, however in order to save all partitions have to be loaded into memory. This can often cause out of memory issues if the sum of all partitions is very large. Currently there isn't an easy way to perform an 'intermediate' save within the node.In the last few weeks several users trying to process a large amount of image data have raised this limitation independently:
Context
The partitioned dataset is a wonderful tool, but in many cases users want to process one partition at a time - not in bulk. The current save mechanism is memory constrained.
Possible Implementation
This needs more technical design - but it feels quite tangible. Currently things look like this:
Going forward it would be we could introduce a special node type that has the core assumption that the same number of input partitions as output partitions.
Here, as with the rest of Kedro - the node has no knowledge of how the data gets loaded or saved - but we allow the users to define a node that simple processes one partition at a time.
To introduce this I think we'd need to make a few changes to the library:
PartitionedDataSet
to be able to easily load and save a single partition. I've got this working on a branch calledfeature/partitioned-node
as an experiment and think this change is useful as is because it would make hooks more powerful.ParititionedNode
orNodeIterator
could work as names._call_node_run
to essentially run this in a sort of batch mode.The text was updated successfully, but these errors were encountered: