Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental PartitionedDataset saves #499

Closed
crypdick opened this issue Sep 3, 2020 · 9 comments
Closed

Incremental PartitionedDataset saves #499

crypdick opened this issue Sep 3, 2020 · 9 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@crypdick
Copy link

crypdick commented Sep 3, 2020

Description

PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.

Context

  1. I am creating a deep ensemble by running inference using many models. If I get a runtime error, I lose all the cached inference results from the already-run models. This happened to me 2 days into an inference job, because my cluster's ssh connection timed out.
  2. I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum allowable ensemble size. So, I am forced to run this pipeline on a memory-optimized EC2 instance when it could otherwise run on my laptop.

Possible Implementation

Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.

def partition_dataset_writer() -> Dict[str, pd.DataFrame]:
    for _ in range(10):
       part = {"part_name": pd.DataFrame(...)}
       yield part
@crypdick crypdick added the Issue: Feature Request New feature or improvement to existing feature label Sep 3, 2020
@takeru
Copy link
Contributor

takeru commented Oct 3, 2020

I make patch to delay to generate outputs.

master...takeru:feature/partitioned_delayed_save

It is working good in my project.

def make_features(input: pd.DataFrame, years: List[int]) -> Dict[str, Any]:
    parts = {}
    for year in years:
        part_key = f"features-year{year}"
        print(f"part_key: {part_key} {_mem_info()}")
        if True:
            def f(input_=input, year_=year, part_key_=part_key):
                print(f"(in closure) before part_key: {part_key_} {_mem_info()}")
                output = _make_features(input_, year_)
                print(f"(in closure) after  part_key: {part_key_} {_mem_info()}")
                return output
            features = f
        else:
            features = _make_features(input, year)
        parts[part_key] = features
    return parts

Please review about it.
If it is OK. I will make tests, write short docs, and send PR.

@crypdick
Copy link
Author

crypdick commented Oct 4, 2020

Thanks @takeru. I haven't tested your branch, but I think it will cause issues with after_node_run hooks which expect the partitions to not be callables.

@takeru
Copy link
Contributor

takeru commented Oct 5, 2020

Thank you,

I checked specs:
https://kedro.readthedocs.io/en/stable/kedro.framework.hooks.specs.NodeSpecs.html#kedro.framework.hooks.specs.NodeSpecs.after_node_run

Output type is Any, then all hook can't support all output types. I think, user should take care of combination of output type and Dataset class by design.

PartitionDatasets has special functions. Input of that is special, is callable too. We also need to take care of PartitionDatasets and before_xxx_hook.

Anyway, putting everything in memory and then writing it out to disk is not a good way to do it.

@lou-k
Copy link
Contributor

lou-k commented Oct 27, 2020

I have a similar need; lazy writes would be very helpful for this or the incremental data set.

@t00rgore
Copy link

I am facing a similar need, would be great to have an incremental save option.

@elephantum
Copy link

Same here, this design limitation is very frustrating.

Imaging preprocessing 10K images.

@yurigba
Copy link

yurigba commented Feb 17, 2021

This feature is much needed for geospatial data. I look forward to have something to use in this context. I'd suggest some kind of lazy dataset that does nothing when saving the whole dataset, but rather saves when you call the method $save$ in the loaded LazyDataSet object somehow. Then you use transcoding and convert it to a PartitionedDataSet and you can open as needed...

@lou-k
Copy link
Contributor

lou-k commented Apr 2, 2021

I opened #744

@merelcht
Copy link
Member

Closing this issue as #744 has now been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

7 participants