Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[hail] Add table and matrix table indices #6266
This currently is exposed in
This adds an
One index file is written per partition. For matrix tables, the offset
When writing data with a blocked spec (like the default) we use virtual
Ok, benchmarks are pretty bad, these are the range table benchmarks from #6529
For where we are, this seems pretty good. Range table is the worst case because the key consists of all the data. For the key, the index should have roughly 2x cost (stores an extra copy of the key, along with metadata like the file offset). What's more, we're using Java values for the keys and annotations, which is also really bad.
A better to do would be have each node (internal or leaf) have a region that stores the keys, and directly write that out instead of using going from Java object to region value to serialization.
It is spending it's time in