Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental PartitionedDataset saves #499

crypdick opened this issue Sep 3, 2020 · 9 comments

Incremental PartitionedDataset saves #499

crypdick opened this issue Sep 3, 2020 · 9 comments
Issue: Feature Request New feature or improvement to existing feature


Copy link

crypdick commented Sep 3, 2020


PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.


  1. I am creating a deep ensemble by running inference using many models. If I get a runtime error, I lose all the cached inference results from the already-run models. This happened to me 2 days into an inference job, because my cluster's ssh connection timed out.
  2. I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum allowable ensemble size. So, I am forced to run this pipeline on a memory-optimized EC2 instance when it could otherwise run on my laptop.

Possible Implementation

Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.

def partition_dataset_writer() -> Dict[str, pd.DataFrame]:
    for _ in range(10):
       part = {"part_name": pd.DataFrame(...)}
       yield part
@crypdick crypdick added the Issue: Feature Request New feature or improvement to existing feature label Sep 3, 2020
Copy link

takeru commented Oct 3, 2020

I make patch to delay to generate outputs.


It is working good in my project.

def make_features(input: pd.DataFrame, years: List[int]) -> Dict[str, Any]:
    parts = {}
    for year in years:
        part_key = f"features-year{year}"
        print(f"part_key: {part_key} {_mem_info()}")
        if True:
            def f(input_=input, year_=year, part_key_=part_key):
                print(f"(in closure) before part_key: {part_key_} {_mem_info()}")
                output = _make_features(input_, year_)
                print(f"(in closure) after  part_key: {part_key_} {_mem_info()}")
                return output
            features = f
            features = _make_features(input, year)
        parts[part_key] = features
    return parts

Please review about it.
If it is OK. I will make tests, write short docs, and send PR.

Copy link

crypdick commented Oct 4, 2020

Thanks @takeru. I haven't tested your branch, but I think it will cause issues with after_node_run hooks which expect the partitions to not be callables.

Copy link

takeru commented Oct 5, 2020

Thank you,

I checked specs:

Output type is Any, then all hook can't support all output types. I think, user should take care of combination of output type and Dataset class by design.

PartitionDatasets has special functions. Input of that is special, is callable too. We also need to take care of PartitionDatasets and before_xxx_hook.

Anyway, putting everything in memory and then writing it out to disk is not a good way to do it.

Copy link

lou-k commented Oct 27, 2020

I have a similar need; lazy writes would be very helpful for this or the incremental data set.

Copy link

I am facing a similar need, would be great to have an incremental save option.

Copy link

Same here, this design limitation is very frustrating.

Imaging preprocessing 10K images.

Copy link

yurigba commented Feb 17, 2021

This feature is much needed for geospatial data. I look forward to have something to use in this context. I'd suggest some kind of lazy dataset that does nothing when saving the whole dataset, but rather saves when you call the method $save$ in the loaded LazyDataSet object somehow. Then you use transcoding and convert it to a PartitionedDataSet and you can open as needed...

Copy link

lou-k commented Apr 2, 2021

I opened #744

Copy link

Closing this issue as #744 has now been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Issue: Feature Request New feature or improvement to existing feature
None yet

No branches or pull requests

7 participants