New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental PartitionedDataset saves #499
Comments
I make patch to delay to generate outputs. master...takeru:feature/partitioned_delayed_save It is working good in my project. def make_features(input: pd.DataFrame, years: List[int]) -> Dict[str, Any]:
parts = {}
for year in years:
part_key = f"features-year{year}"
print(f"part_key: {part_key} {_mem_info()}")
if True:
def f(input_=input, year_=year, part_key_=part_key):
print(f"(in closure) before part_key: {part_key_} {_mem_info()}")
output = _make_features(input_, year_)
print(f"(in closure) after part_key: {part_key_} {_mem_info()}")
return output
features = f
else:
features = _make_features(input, year)
parts[part_key] = features
return parts Please review about it. |
Thanks @takeru. I haven't tested your branch, but I think it will cause issues with after_node_run hooks which expect the partitions to not be callables. |
Thank you, I checked specs: Output type is Any, then all hook can't support all output types. I think, user should take care of combination of output type and Dataset class by design. PartitionDatasets has special functions. Input of that is special, is callable too. We also need to take care of PartitionDatasets and before_xxx_hook. Anyway, putting everything in memory and then writing it out to disk is not a good way to do it. |
I have a similar need; lazy writes would be very helpful for this or the incremental data set. |
I am facing a similar need, would be great to have an incremental save option. |
Same here, this design limitation is very frustrating. Imaging preprocessing 10K images. |
This feature is much needed for geospatial data. I look forward to have something to use in this context. I'd suggest some kind of lazy dataset that does nothing when saving the whole dataset, but rather saves when you call the method |
I opened #744 |
Closing this issue as #744 has now been merged. |
Description
PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.
Context
Possible Implementation
Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.
The text was updated successfully, but these errors were encountered: