Inline sort and group #473
pavlis
started this conversation in
Design & Development
Replies: 1 comment
-
I don't think it solves this problem because it is more of a dump/restore operator, but I just ran across the existence of a spark RDD method called pickleFile. The minimal documentation can be found here. It is a method of SparkContext. I don't think dask has a comparable thing. I think the main thing it illustrates is the feasibility of that model. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This may be a useless idea, but it is something that came to mind while I was writing a new section for the User Manual on parallel IO considerations (a user manual page on the new read_dstributed_data and write_distributed_data implementations).
The background is that dask and spark are known to be terrible at sorting and grouping data. Not only are they slow, but I've seen examples where a large data set will abort with a memory fault if you attempt a groupby/Foldby type operation. The reason is the bag/RDD containers we use do not have what in C++ could be called a random iterator. They are large data versions of a linked list and can only be traversed from front to back. Subscripting is not possible.
With that background note the only solution we have now, which appears in my current memory management section of the user manual, is to use the database to handle this problem. That requires a nontrivial save and read operation, both of which have enough complexity to be slow.
What I wonder is if we could write a scratch file with pickle in map operation loading the file offset for each datum into each datum's Metadata. That function could (internally) return a pandas Dataframe with attributes needed to defines sort and/or grouping. That Dataframe could then be used to do random reads with the scratch file to reassemble the bag/rdd in a different order. For ensembles grouping should be possible by a similar mechanism.
A lot of details to confirm that might actually work, but I wanted to put it out there for comments.
Beta Was this translation helpful? Give feedback.
All reactions