PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.
I am creating a deep ensemble by running inference using many models. If I get a runtime error, I lose all the cached inference results from the already-run models. This happened to me 2 days into an inference job, because my cluster's ssh connection timed out.
I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum allowable ensemble size. So, I am forced to run this pipeline on a memory-optimized EC2 instance when it could otherwise run on my laptop.
Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.
This feature is much needed for geospatial data. I look forward to have something to use in this context. I'd suggest some kind of lazy dataset that does nothing when saving the whole dataset, but rather saves when you call the method $save$ in the loaded LazyDataSet object somehow. Then you use transcoding and convert it to a PartitionedDataSet and you can open as needed...