-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hail] Add table and matrix table indices #6266
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool change.
filter_intervals=bool) | ||
def __init__(self, path, intervals, filter_intervals): | ||
if intervals is not None: | ||
t = hl.expr.impute_type(intervals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic should be in read_matrix_table -- we generally assume that parameters on the IR have been validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, shouldn't there be a check that the interval type agrees with the key type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't do that until this is wrapped in a Table
/MatrixTable
(python classes). We have no access to the key type at this point. I can certainly move much of the logic into read(_matrix)_table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, realized that after I commented and was looking at some of the code i master. I think it would be preferable to do the type lookup in read_matrix_table and then pass it into the IR node, but that's definitely out of scope of this change.
I ran the python tests with my latest changes and they all passed. I'll work on addressing comments and adding tests, but this should be ready for another round of code review. |
Okay. This is 'done' (pending more review and revisions). |
I'll try to give this a once-over this afternoon. |
Also need to see benchmarks. |
Still need to write benchmarks, but I wrote a new type to hold the options for the indexed reads, please take a look. |
Ok, benchmarks are pretty bad, these are the range table benchmarks from #6529 No Index
Index
|
For where we are, this seems pretty good. Range table is the worst case because the key consists of all the data. For the key, the index should have roughly 2x cost (stores an extra copy of the key, along with metadata like the file offset). What's more, we're using Java values for the keys and annotations, which is also really bad. A better to do would be have each node (internal or leaf) have a region that stores the keys, and directly write that out instead of using going from Java object to region value to serialization. |
Chris and I discussed this lasat week. |
Can we do a bit of profiling to figure out where it's spending the time? |
This currently is exposed in `read_table` and `read_matrix_table` as undocumented optional parameters `_intervals` and `_filter_intervals` which takes a list of python `Interval`s that are used either as a filter or a repartition. This adds an `IndexedRVDSpec` as the primary container format for indexed data, and increments the file version to 1.1.0. One index file is written per partition. For matrix tables, the offset to a particular key for the entries is stored in the `entries_offset` field of the annotation that an index may contain. We use the new `IndexSpec` to retrive the appropriate offset from the index so that we can seek to the proper offset in the partition. When writing data with a blocked spec (like the default) we use virtual index offsets similar to tabix. The high 48 bits are used to indicate the file offset to the start of a block, and the low 16 bits are used to indicate the offset from the start of the (possibly decompressed) block.
* Only one version * Only one AbstractRVDSpec `read` method. * Being indexed is a property of the spec, not the version
It is spending it's time in |
Great, this can also be fixed by not using Java objects. That makes me feel good about things. |
This currently is exposed in
read_table
andread_matrix_table
asundocumented optional parameters
_intervals
and_filter_intervals
which takes a list of python
Interval
s that are used either asa filter or a repartition.
This adds an
IndexedRVDSpec
as the primary container format forindexed data, and increments the file version to 1.1.0.
One index file is written per partition. For matrix tables, the offset
to a particular key for the entries is stored in the
entries_offset
field of the annotation that an index may contain. We use the new
IndexSpec
to retrive the appropriate offset from the index so that wecan seek to the proper offset in the partition.
When writing data with a blocked spec (like the default) we use virtual
index offsets similar to tabix. The high 48 bits are used to indicate
the file offset to the start of a block, and the low 16 bits are used to
indicate the offset from the start of the (possibly decompressed) block.
cc @cseed