Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Add table and matrix table indices #6266

Merged
merged 5 commits into from Jul 3, 2019
Merged

Conversation

@chrisvittal
Copy link
Collaborator

@chrisvittal chrisvittal commented Jun 5, 2019

This currently is exposed in read_table and read_matrix_table as
undocumented optional parameters _intervals and _filter_intervals
which takes a list of python Intervals that are used either as
a filter or a repartition.

This adds an IndexedRVDSpec as the primary container format for
indexed data, and increments the file version to 1.1.0.

One index file is written per partition. For matrix tables, the offset
to a particular key for the entries is stored in the entries_offset
field of the annotation that an index may contain. We use the new
IndexSpec to retrive the appropriate offset from the index so that we
can seek to the proper offset in the partition.

When writing data with a blocked spec (like the default) we use virtual
index offsets similar to tabix. The high 48 bits are used to indicate
the file offset to the start of a block, and the low 16 bits are used to
indicate the offset from the start of the (possibly decompressed) block.

cc @cseed

Copy link
Collaborator

@tpoterba tpoterba left a comment

very cool change.

Loading

filter_intervals=bool)
def __init__(self, path, intervals, filter_intervals):
if intervals is not None:
t = hl.expr.impute_type(intervals)
Copy link
Collaborator

@tpoterba tpoterba Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should be in read_matrix_table -- we generally assume that parameters on the IR have been validated.

Loading

Copy link
Collaborator

@tpoterba tpoterba Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, shouldn't there be a check that the interval type agrees with the key type?

Loading

Copy link
Collaborator Author

@chrisvittal chrisvittal Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do that until this is wrapped in a Table/MatrixTable (python classes). We have no access to the key type at this point. I can certainly move much of the logic into read(_matrix)_table.

Loading

Copy link
Collaborator

@tpoterba tpoterba Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, realized that after I commented and was looking at some of the code i master. I think it would be preferable to do the type lookup in read_matrix_table and then pass it into the IR node, but that's definitely out of scope of this change.

Loading

hail/python/hail/ir/matrix_reader.py Outdated Show resolved Hide resolved
Loading
hail/python/hail/ir/table_reader.py Show resolved Hide resolved
Loading
hail/python/hail/methods/impex.py Outdated Show resolved Hide resolved
Loading
hail/python/hail/methods/impex.py Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/HailContext.scala Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/expr/ir/MatrixValue.scala Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/table/Table.scala Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/table/Table.scala Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/variant/MatrixTable.scala Outdated Show resolved Hide resolved
Loading
@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Jun 20, 2019

I ran the python tests with my latest changes and they all passed. I'll work on addressing comments and adding tests, but this should be ready for another round of code review.

Loading

@chrisvittal chrisvittal dismissed tpoterba’s stale review Jun 24, 2019

Addressed most of the comments.

@chrisvittal chrisvittal changed the title WIP Add Indices [hail] Add table and matrix table indices Jun 25, 2019
@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Jun 25, 2019

Okay. This is 'done' (pending more review and revisions).

Loading

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jun 26, 2019

I'll try to give this a once-over this afternoon.

Loading

hail/src/main/scala/is/hail/expr/ir/TableIR.scala Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/expr/ir/TableIR.scala Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/HailContext.scala Outdated Show resolved Hide resolved
Loading
hail/src/main/scala/is/hail/expr/ir/MatrixIR.scala Outdated Show resolved Hide resolved
Loading
@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jun 28, 2019

Also need to see benchmarks.

Loading

@chrisvittal chrisvittal requested a review from tpoterba Jul 1, 2019
@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Jul 1, 2019

Still need to write benchmarks, but I wrote a new type to hold the options for the indexed reads, please take a look.

Loading

@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Jul 1, 2019

Ok, benchmarks are pretty bad, these are the range table benchmarks from #6529

No Index

running write_range_table_p1000...
    Mean, Median: 11.79s, 11.18s
running write_range_table_p100...
    Mean, Median: 4.76s, 4.67s
running write_range_table_p10...
    Mean, Median: 3.28s, 3.03s

Index

running write_range_table_p1000...
    Mean, Median: 28.60s, 28.88s
running write_range_table_p100...
    Mean, Median: 10.17s, 10.20s
running write_range_table_p10...
    Mean, Median: 9.52s, 9.44s

Loading

@cseed
Copy link
Collaborator

@cseed cseed commented Jul 1, 2019

For where we are, this seems pretty good. Range table is the worst case because the key consists of all the data. For the key, the index should have roughly 2x cost (stores an extra copy of the key, along with metadata like the file offset). What's more, we're using Java values for the keys and annotations, which is also really bad.

A better to do would be have each node (internal or leaf) have a region that stores the keys, and directly write that out instead of using going from Java object to region value to serialization.

Loading

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jul 1, 2019

A better to do would be have each node (internal or leaf) have a region that stores the keys, and directly write that out instead of using going from Java object to region value to serialization.

Chris and I discussed this lasat week.

Loading

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jul 1, 2019

Can we do a bit of profiling to figure out where it's spending the time?

Loading

chrisvittal added 3 commits Jul 2, 2019
This currently is exposed in `read_table` and `read_matrix_table` as
undocumented optional parameters `_intervals` and `_filter_intervals`
which takes a list of python `Interval`s that are used either as
a filter or a repartition.

This adds an `IndexedRVDSpec` as the primary container format for
indexed data, and increments the file version to 1.1.0.

One index file is written per partition. For matrix tables, the offset
to a particular key for the entries is stored in the `entries_offset`
field of the annotation that an index may contain. We use the new
`IndexSpec` to retrive the appropriate offset from the index so that we
can seek to the proper offset in the partition.

When writing data with a blocked spec (like the default) we use virtual
index offsets similar to tabix. The high 48 bits are used to indicate
the file offset to the start of a block, and the low 16 bits are used to
indicate the offset from the start of the (possibly decompressed) block.
* Only one version
* Only one AbstractRVDSpec `read` method.
* Being indexed is a property of the spec, not the version
@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Jul 2, 2019

It is spending it's time in IndexWriter.+=. For this particular example, we more than triple the data we're writing, before we were writing an int32, now we're writing an int32 row, along with a B-Tree with an int32 key and an int64 offset field. So honestly this is pretty good. This was always going to make writes slower.

Loading

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jul 2, 2019

It is spending it's time in IndexWriter.+=

Great, this can also be fixed by not using Java objects. That makes me feel good about things.

Loading

@chrisvittal chrisvittal dismissed tpoterba’s stale review Jul 3, 2019

Think I got everything

@danking danking merged commit 277ccc7 into hail-is:master Jul 3, 2019
1 check passed
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants