Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Add table and matrix table indices #6266

Merged
merged 5 commits into from Jul 3, 2019

Conversation

@chrisvittal
Copy link
Collaborator

commented Jun 5, 2019

This currently is exposed in read_table and read_matrix_table as
undocumented optional parameters _intervals and _filter_intervals
which takes a list of python Intervals that are used either as
a filter or a repartition.

This adds an IndexedRVDSpec as the primary container format for
indexed data, and increments the file version to 1.1.0.

One index file is written per partition. For matrix tables, the offset
to a particular key for the entries is stored in the entries_offset
field of the annotation that an index may contain. We use the new
IndexSpec to retrive the appropriate offset from the index so that we
can seek to the proper offset in the partition.

When writing data with a blocked spec (like the default) we use virtual
index offsets similar to tabix. The high 48 bits are used to indicate
the file offset to the start of a block, and the low 16 bits are used to
indicate the offset from the start of the (possibly decompressed) block.

cc @cseed

@tpoterba
Copy link
Collaborator

left a comment

very cool change.

filter_intervals=bool)
def __init__(self, path, intervals, filter_intervals):
if intervals is not None:
t = hl.expr.impute_type(intervals)

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 6, 2019

Collaborator

I think this logic should be in read_matrix_table -- we generally assume that parameters on the IR have been validated.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 6, 2019

Collaborator

also, shouldn't there be a check that the interval type agrees with the key type?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Jun 6, 2019

Author Collaborator

Can't do that until this is wrapped in a Table/MatrixTable (python classes). We have no access to the key type at this point. I can certainly move much of the logic into read(_matrix)_table.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 6, 2019

Collaborator

Yeah, realized that after I commented and was looking at some of the code i master. I think it would be preferable to do the type lookup in read_matrix_table and then pass it into the IR node, but that's definitely out of scope of this change.

hail/python/hail/ir/matrix_reader.py Outdated Show resolved Hide resolved
hail/python/hail/ir/table_reader.py Show resolved Hide resolved
hail/python/hail/methods/impex.py Outdated Show resolved Hide resolved
hail/python/hail/methods/impex.py Outdated Show resolved Hide resolved
hail/src/main/scala/is/hail/HailContext.scala Outdated Show resolved Hide resolved
hail/src/main/scala/is/hail/expr/ir/MatrixValue.scala Outdated Show resolved Hide resolved
hail/src/main/scala/is/hail/table/Table.scala Outdated Show resolved Hide resolved
hail/src/main/scala/is/hail/table/Table.scala Outdated Show resolved Hide resolved
hail/src/main/scala/is/hail/variant/MatrixTable.scala Outdated Show resolved Hide resolved

@chrisvittal chrisvittal force-pushed the chrisvittal:indicies branch from 3eea286 to 66e29bb Jun 6, 2019

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 20, 2019

I ran the python tests with my latest changes and they all passed. I'll work on addressing comments and adding tests, but this should be ready for another round of code review.

Addressed most of the comments.

@chrisvittal chrisvittal force-pushed the chrisvittal:indicies branch from e95038b to 9b4ce49 Jun 25, 2019

@chrisvittal chrisvittal changed the title WIP Add Indices [hail] Add table and matrix table indices Jun 25, 2019

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 25, 2019

Okay. This is 'done' (pending more review and revisions).

@chrisvittal chrisvittal force-pushed the chrisvittal:indicies branch from 9b4ce49 to 9ed6902 Jun 25, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jun 26, 2019

I'll try to give this a once-over this afternoon.

@chrisvittal chrisvittal force-pushed the chrisvittal:indicies branch from 2738163 to 9412c0c Jun 28, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jun 28, 2019

Also need to see benchmarks.

@chrisvittal chrisvittal requested a review from tpoterba Jul 1, 2019

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 1, 2019

Still need to write benchmarks, but I wrote a new type to hold the options for the indexed reads, please take a look.

@chrisvittal chrisvittal force-pushed the chrisvittal:indicies branch from 6249db3 to d7b062b Jul 1, 2019

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 1, 2019

Ok, benchmarks are pretty bad, these are the range table benchmarks from #6529

No Index

running write_range_table_p1000...
    Mean, Median: 11.79s, 11.18s
running write_range_table_p100...
    Mean, Median: 4.76s, 4.67s
running write_range_table_p10...
    Mean, Median: 3.28s, 3.03s

Index

running write_range_table_p1000...
    Mean, Median: 28.60s, 28.88s
running write_range_table_p100...
    Mean, Median: 10.17s, 10.20s
running write_range_table_p10...
    Mean, Median: 9.52s, 9.44s
@cseed

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

For where we are, this seems pretty good. Range table is the worst case because the key consists of all the data. For the key, the index should have roughly 2x cost (stores an extra copy of the key, along with metadata like the file offset). What's more, we're using Java values for the keys and annotations, which is also really bad.

A better to do would be have each node (internal or leaf) have a region that stores the keys, and directly write that out instead of using going from Java object to region value to serialization.

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

A better to do would be have each node (internal or leaf) have a region that stores the keys, and directly write that out instead of using going from Java object to region value to serialization.

Chris and I discussed this lasat week.

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

Can we do a bit of profiling to figure out where it's spending the time?

chrisvittal added some commits Jun 25, 2019

[hail] Add table and matrix table indices
This currently is exposed in `read_table` and `read_matrix_table` as
undocumented optional parameters `_intervals` and `_filter_intervals`
which takes a list of python `Interval`s that are used either as
a filter or a repartition.

This adds an `IndexedRVDSpec` as the primary container format for
indexed data, and increments the file version to 1.1.0.

One index file is written per partition. For matrix tables, the offset
to a particular key for the entries is stored in the `entries_offset`
field of the annotation that an index may contain. We use the new
`IndexSpec` to retrive the appropriate offset from the index so that we
can seek to the proper offset in the partition.

When writing data with a blocked spec (like the default) we use virtual
index offsets similar to tabix. The high 48 bits are used to indicate
the file offset to the start of a block, and the low 16 bits are used to
indicate the offset from the start of the (possibly decompressed) block.
Incorporate verbal feedback
* Only one version
* Only one AbstractRVDSpec `read` method.
* Being indexed is a property of the spec, not the version

@chrisvittal chrisvittal force-pushed the chrisvittal:indicies branch from d7b062b to 1786b78 Jul 2, 2019

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 2, 2019

It is spending it's time in IndexWriter.+=. For this particular example, we more than triple the data we're writing, before we were writing an int32, now we're writing an int32 row, along with a B-Tree with an int32 key and an int64 offset field. So honestly this is pretty good. This was always going to make writes slower.

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jul 2, 2019

It is spending it's time in IndexWriter.+=

Great, this can also be fixed by not using Java objects. That makes me feel good about things.

chrisvittal added some commits Jul 3, 2019

Think I got everything

@danking danking merged commit 277ccc7 into hail-is:master Jul 3, 2019

1 check passed

ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.