Provide way to process partition datasets one partition at a time #1413

datajoely · 2022-04-05T13:24:49Z

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

PartitionedDataSet today provides a lazy method for loading each partition in a memory efficient way, however in order to save all partitions have to be loaded into memory. This can often cause out of memory issues if the sum of all partitions is very large. Currently there isn't an easy way to perform an 'intermediate' save within the node.

In the last few weeks several users trying to process a large amount of image data have raised this limitation independently:

Discord thread
@aaranovi mentioned this to me directly when trying to abstract one of the PyTorch image loaders

Context

Why is this change important to you? How would you use it? How can it benefit other users?

The partitioned dataset is a wonderful tool, but in many cases users want to process one partition at a time - not in bulk. The current save mechanism is memory constrained.

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

This needs more technical design - but it feels quite tangible. Currently things look like this:

Going forward it would be we could introduce a special node type that has the core assumption that the same number of input partitions as output partitions.

Here, as with the rest of Kedro - the node has no knowledge of how the data gets loaded or saved - but we allow the users to define a node that simple processes one partition at a time.

To introduce this I think we'd need to make a few changes to the library:

We would need to update PartitionedDataSet to be able to easily load and save a single partition. I've got this working on a branch called feature/partitioned-node as an experiment and think this change is useful as is because it would make hooks more powerful.
We would need to introduce some way of identifying this special type of node, possibly a new subclass. ParititionedNode or NodeIterator could work as names.
We would finally need to make changes to _call_node_run to essentially run this in a sort of batch mode.

The text was updated successfully, but these errors were encountered:

deepyaman · 2022-04-07T01:22:25Z

however in order to save all partitions have to be loaded into memory

PartitionedDataSet has supported lazily materializing data on save since 0.17.4. Return Callables as values in the dictionary returned by the node (to be saved by PartitionedDataSet) in order to take advantage of this functionality.

datajoely · 2022-04-07T09:00:30Z

@deepyaman could you provide a snippet? This would be much simpler than my proposal and I'd love to include it in the docs

deepyaman · 2022-04-07T10:44:26Z

@deepyaman could you provide a snippet? This would be much simpler than my proposal and I'd love to include it in the docs

I'll create/share an example later today.

deepyaman · 2022-04-08T18:38:29Z

@datajoely Upon looking, there's already an example at the bottom of https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset-save. I was going to create a "more realistic/relevant" example (image transpositions as a preprocessing step), but let me know if that's necessary now that we see there's already an example.

datajoely · 2022-04-11T09:11:29Z

@deepyaman I'd love to see one, it's actually coming up multiple times a week on discord :)

kadeshoe5 · 2022-04-11T15:14:26Z

Here is an example I made that worked for a node that inputted and outputted a partition:

def preprocess_all_data(partitioned_input: Dict[str, Callable[[], Any]]) -> Dict[str, Callable[[], Any]]:
    
    return {
        key: (lambda: _preprocess_partion(load_func())) for key, load_func in partitioned_input.items()
    }

Here you are using a function _preprocess_partion that would take a single data frame as input, and then applying it lazily using comprehension.

datajoely · 2022-04-13T12:40:12Z

I'm actually closing this as I didn't realise there was a neat way of doing it!

auggie246 · 2024-02-28T17:14:55Z

Here is an example I made that worked for a node that inputted and outputted a partition:
def preprocess_all_data(partitioned_input: Dict[str, Callable[[], Any]]) -> Dict[str, Callable[[], Any]]:
    
    return {
        key: (lambda: _preprocess_partion(load_func())) for key, load_func in partitioned_input.items()
    }
Here you are using a function _preprocess_partion that would take a single data frame as input, and then applying it lazily using comprehension.

A late follow up but I have been trying to implement this but facing a weird error. Using your example, load_func() will always end up loading the last file in my folder.
I did manage to fix it by modifying your example.

 def preprocess_all_data(partitioned_input: Dict[str, Callable[[], Any]]) -> Dict[str, Callable[[], Any]]:
     
     return {
         key: (lambda load_func=load_func: _preprocess_partion(load_func())) for key, load_func in partitioned_input.items()
     }

The issue is related to Python's late binding closures, which can cause the lambda function to use the value of load_func as it exists at the end of the loop, rather than its value at each iteration. This is why all the lambda functions end up referring to the same load_func and hence, resulting in the same DataFrame every time I called them.

Hope this help anyone else who face this issue. Spent a day debugging this.

datajoely · 2024-02-28T18:23:22Z

Amazing work @auggie246 we need to get this in the docs ASAP!

datajoely · 2024-02-28T18:28:30Z

For anyone trying to spot the diff above

def preprocess_all_data(
  partitioned_input: Dict[str, Callable[[], Any]],
) -> Dict[str, Callable[[], Any]]:
  return {
-    key: (lambda: _preprocess_partion(load_func()))
+    key: (lambda load_func=load_func: _preprocess_partion(load_func()))
    for key, load_func in partitioned_input.items()
  }

datajoely added the Issue: Feature Request New feature or improvement to existing feature label Apr 5, 2022

datajoely closed this as completed Apr 13, 2022

noklam mentioned this issue Mar 28, 2024

Kedro with custom execution engine? (Ray) #479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide way to process partition datasets one partition at a time #1413

Provide way to process partition datasets one partition at a time #1413

datajoely commented Apr 5, 2022

deepyaman commented Apr 7, 2022 •

edited

Loading

datajoely commented Apr 7, 2022

deepyaman commented Apr 7, 2022

deepyaman commented Apr 8, 2022

datajoely commented Apr 11, 2022

kadeshoe5 commented Apr 11, 2022

datajoely commented Apr 13, 2022

auggie246 commented Feb 28, 2024 •

edited by datajoely

Loading

datajoely commented Feb 28, 2024

datajoely commented Feb 28, 2024

Provide way to process partition datasets one partition at a time #1413

Provide way to process partition datasets one partition at a time #1413

Comments

datajoely commented Apr 5, 2022

Description

Context

Possible Implementation

deepyaman commented Apr 7, 2022 • edited Loading

datajoely commented Apr 7, 2022

deepyaman commented Apr 7, 2022

deepyaman commented Apr 8, 2022

datajoely commented Apr 11, 2022

kadeshoe5 commented Apr 11, 2022

datajoely commented Apr 13, 2022

auggie246 commented Feb 28, 2024 • edited by datajoely Loading

datajoely commented Feb 28, 2024

datajoely commented Feb 28, 2024

deepyaman commented Apr 7, 2022 •

edited

Loading

auggie246 commented Feb 28, 2024 •

edited by datajoely

Loading