-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] IBD implemented in terms of block matrices #12629
Conversation
3c412f3
to
ca1ba01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice and clean. I only see a few little things to clean up.
'min': min, | ||
'max': max, | ||
})).persist() | ||
|
||
require_col_key_str(dataset, 'identity_by_descent') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want this before the spark backend case, don't we?
dataset = dataset.select_cols().select_globals().select_entries('GT') | ||
dataset = require_biallelic(dataset, 'ibd') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate lines
IS_HOM_REF = BlockMatrix.from_entry_expr(dataset.is_hom_ref).checkpoint(hl.utils.new_temp_file()) | ||
IS_HET = BlockMatrix.from_entry_expr(dataset.is_het).checkpoint(hl.utils.new_temp_file()) | ||
IS_HOM_VAR = BlockMatrix.from_entry_expr(dataset.is_hom_var).checkpoint(hl.utils.new_temp_file()) | ||
NOT_MISSING = BlockMatrix.from_entry_expr(dataset.is_not_missing).checkpoint(hl.utils.new_temp_file()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the expense of from_entry_expr
, I think it's worth doing the minimum necessary. These four are related by not_missing = is_hom_ref + is_het + is_hom_var
, so any one can be computed from the other three.
@patrick-schultz Are you okay with where the checkpoints are? If yes, then I'll do one last set of benchmarks and then this is ready to go! |
if not 0 <= min <= max <= 1: | ||
raise Exception(f"invalid pi hat filters {min} {max}") | ||
|
||
sample_ids = dataset.s.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're making multiple passes over dataset
, we probably want to checkpoint it before this first pass.
NOT_MISSING = IS_HOM_REF + IS_HET + IS_HOM_VAR | ||
|
||
total_possible_ibs = NOT_MISSING.T @ NOT_MISSING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if checkpointing NOT_MISSING before the matmul if worth the cost of the write. Might try benchmarking both ways.
ibs1_pre = (IS_HET.T @ is_not_het).checkpoint(hl.utils.new_temp_file()) | ||
ibs1 = ibs1_pre + ibs1_pre.T | ||
|
||
ibs2 = total_possible_ibs - ibs0 - ibs1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to checkpoint ibs0, ibs1, ibs2. But again, it's not clear to me if the io or the repeated computation is worse.
|
Patrick -- can you give this a careful look over? Once we're happy with it, I'll redo the benchmarking as a last sanity check.