Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[hail] memory-efficient scan #6345
This adds SpillingCollectIterator which avoids holding more than 1000 aggregation results in memory at one time. We could do something that listens for GC events and spills data if there's high memory pressure. That seems a bit error prone and hard.
The number of results kept in memory is a flag on the HailContext. In C++ we can design a system that is aware of its memory usage and adjusts memory allocated to scans accordingly.
I had to add two new file operations to
When we overflow our in-memory buffer, we spill to a disk file. We use O(n_partitions / mem_limit) files. We stream through the files to
I somewhat better solution would be to eagerly scan as results come in. I leave that as future work.
For posterity this is what goes wrong if I don't create a fresh OOS for each object: