Incremental PartitionedDataset saves #499

crypdick · 2020-09-03T18:29:23Z

Description

PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.

Context

I am creating a deep ensemble by running inference using many models. If I get a runtime error, I lose all the cached inference results from the already-run models. This happened to me 2 days into an inference job, because my cluster's ssh connection timed out.
I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum allowable ensemble size. So, I am forced to run this pipeline on a memory-optimized EC2 instance when it could otherwise run on my laptop.

Possible Implementation

Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.

def partition_dataset_writer() -> Dict[str, pd.DataFrame]:
    for _ in range(10):
       part = {"part_name": pd.DataFrame(...)}
       yield part

takeru · 2020-10-03T21:55:28Z

I make patch to delay to generate outputs.

master...takeru:feature/partitioned_delayed_save

It is working good in my project.

def make_features(input: pd.DataFrame, years: List[int]) -> Dict[str, Any]:
    parts = {}
    for year in years:
        part_key = f"features-year{year}"
        print(f"part_key: {part_key} {_mem_info()}")
        if True:
            def f(input_=input, year_=year, part_key_=part_key):
                print(f"(in closure) before part_key: {part_key_} {_mem_info()}")
                output = _make_features(input_, year_)
                print(f"(in closure) after  part_key: {part_key_} {_mem_info()}")
                return output
            features = f
        else:
            features = _make_features(input, year)
        parts[part_key] = features
    return parts

Please review about it.
If it is OK. I will make tests, write short docs, and send PR.

crypdick · 2020-10-04T17:20:52Z

Thanks @takeru. I haven't tested your branch, but I think it will cause issues with after_node_run hooks which expect the partitions to not be callables.

takeru · 2020-10-05T00:43:51Z

Thank you,

I checked specs:
https://kedro.readthedocs.io/en/stable/kedro.framework.hooks.specs.NodeSpecs.html#kedro.framework.hooks.specs.NodeSpecs.after_node_run

Output type is Any, then all hook can't support all output types. I think, user should take care of combination of output type and Dataset class by design.

PartitionDatasets has special functions. Input of that is special, is callable too. We also need to take care of PartitionDatasets and before_xxx_hook.

Anyway, putting everything in memory and then writing it out to disk is not a good way to do it.

lou-k · 2020-10-27T14:26:22Z

I have a similar need; lazy writes would be very helpful for this or the incremental data set.

turn1a · 2020-12-22T18:06:35Z

I am facing a similar need, would be great to have an incremental save option.

elephantum · 2020-12-23T18:54:26Z

Same here, this design limitation is very frustrating.

Imaging preprocessing 10K images.

yurigba · 2021-02-17T22:18:20Z

This feature is much needed for geospatial data. I look forward to have something to use in this context. I'd suggest some kind of lazy dataset that does nothing when saving the whole dataset, but rather saves when you call the method $save$ in the loaded LazyDataSet object somehow. Then you use transcoding and convert it to a PartitionedDataSet and you can open as needed...

lou-k · 2021-04-02T18:08:44Z

I opened #744

merelcht · 2021-05-24T13:36:02Z

Closing this issue as #744 has now been merged.

crypdick added the Issue: Feature Request New feature or improvement to existing feature label Sep 3, 2020

lou-k mentioned this issue Apr 2, 2021

Let PartitionedDataSet lazily materialize data to save #744

Merged

6 tasks

merelcht closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental PartitionedDataset saves #499

Incremental PartitionedDataset saves #499

crypdick commented Sep 3, 2020

takeru commented Oct 3, 2020

crypdick commented Oct 4, 2020

takeru commented Oct 5, 2020

lou-k commented Oct 27, 2020

turn1a commented Dec 22, 2020

elephantum commented Dec 23, 2020

yurigba commented Feb 17, 2021

lou-k commented Apr 2, 2021

merelcht commented May 24, 2021

Incremental PartitionedDataset saves #499

Incremental PartitionedDataset saves #499

Comments

crypdick commented Sep 3, 2020

Description

Context

Possible Implementation

takeru commented Oct 3, 2020

crypdick commented Oct 4, 2020

takeru commented Oct 5, 2020

lou-k commented Oct 27, 2020

turn1a commented Dec 22, 2020

elephantum commented Dec 23, 2020

yurigba commented Feb 17, 2021

lou-k commented Apr 2, 2021

merelcht commented May 24, 2021