[query] IBD implemented in terms of block matrices#12629
Conversation
3c412f3 to
ca1ba01
Compare
patrick-schultz
left a comment
There was a problem hiding this comment.
This is very nice and clean. I only see a few little things to clean up.
| 'max': max, | ||
| })).persist() | ||
|
|
||
| require_col_key_str(dataset, 'identity_by_descent') |
There was a problem hiding this comment.
We want this before the spark backend case, don't we?
| dataset = dataset.select_cols().select_globals().select_entries('GT') | ||
| dataset = require_biallelic(dataset, 'ibd') |
| IS_HOM_REF = BlockMatrix.from_entry_expr(dataset.is_hom_ref).checkpoint(hl.utils.new_temp_file()) | ||
| IS_HET = BlockMatrix.from_entry_expr(dataset.is_het).checkpoint(hl.utils.new_temp_file()) | ||
| IS_HOM_VAR = BlockMatrix.from_entry_expr(dataset.is_hom_var).checkpoint(hl.utils.new_temp_file()) | ||
| NOT_MISSING = BlockMatrix.from_entry_expr(dataset.is_not_missing).checkpoint(hl.utils.new_temp_file()) |
There was a problem hiding this comment.
Given the expense of from_entry_expr, I think it's worth doing the minimum necessary. These four are related by not_missing = is_hom_ref + is_het + is_hom_var, so any one can be computed from the other three.
|
@patrick-schultz Are you okay with where the checkpoints are? If yes, then I'll do one last set of benchmarks and then this is ready to go! |
| if not 0 <= min <= max <= 1: | ||
| raise Exception(f"invalid pi hat filters {min} {max}") | ||
|
|
||
| sample_ids = dataset.s.collect() |
There was a problem hiding this comment.
Since we're making multiple passes over dataset, we probably want to checkpoint it before this first pass.
| NOT_MISSING = IS_HOM_REF + IS_HET + IS_HOM_VAR | ||
|
|
||
| total_possible_ibs = NOT_MISSING.T @ NOT_MISSING |
There was a problem hiding this comment.
Not sure if checkpointing NOT_MISSING before the matmul if worth the cost of the write. Might try benchmarking both ways.
| ibs1_pre = (IS_HET.T @ is_not_het).checkpoint(hl.utils.new_temp_file()) | ||
| ibs1 = ibs1_pre + ibs1_pre.T | ||
|
|
||
| ibs2 = total_possible_ibs - ibs0 - ibs1 |
There was a problem hiding this comment.
I think you want to checkpoint ibs0, ibs1, ibs2. But again, it's not clear to me if the io or the repeated computation is worse.
|
Patrick -- can you give this a careful look over? Once we're happy with it, I'll redo the benchmarking as a last sanity check.