Skip to content

[query] IBD implemented in terms of block matrices#12629

Merged
danking merged 10 commits into
hail-is:mainfrom
jigold:ibd-block-matrix-python
Feb 8, 2023
Merged

[query] IBD implemented in terms of block matrices#12629
danking merged 10 commits into
hail-is:mainfrom
jigold:ibd-block-matrix-python

Conversation

@jigold

@jigold jigold commented Jan 27, 2023

Copy link
Copy Markdown
Contributor

Patrick -- can you give this a careful look over? Once we're happy with it, I'll redo the benchmarking as a last sanity check.

@jigold jigold force-pushed the ibd-block-matrix-python branch from 3c412f3 to ca1ba01 Compare January 27, 2023 18:14

@patrick-schultz patrick-schultz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice and clean. I only see a few little things to clean up.

'max': max,
})).persist()

require_col_key_str(dataset, 'identity_by_descent')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want this before the spark backend case, don't we?

Comment on lines 125 to 126
dataset = dataset.select_cols().select_globals().select_entries('GT')
dataset = require_biallelic(dataset, 'ibd')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate lines

Comment on lines +173 to +176
IS_HOM_REF = BlockMatrix.from_entry_expr(dataset.is_hom_ref).checkpoint(hl.utils.new_temp_file())
IS_HET = BlockMatrix.from_entry_expr(dataset.is_het).checkpoint(hl.utils.new_temp_file())
IS_HOM_VAR = BlockMatrix.from_entry_expr(dataset.is_hom_var).checkpoint(hl.utils.new_temp_file())
NOT_MISSING = BlockMatrix.from_entry_expr(dataset.is_not_missing).checkpoint(hl.utils.new_temp_file())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the expense of from_entry_expr, I think it's worth doing the minimum necessary. These four are related by not_missing = is_hom_ref + is_het + is_hom_var, so any one can be computed from the other three.

@jigold

jigold commented Jan 30, 2023

Copy link
Copy Markdown
Contributor Author

@patrick-schultz Are you okay with where the checkpoints are? If yes, then I'll do one last set of benchmarks and then this is ready to go!

if not 0 <= min <= max <= 1:
raise Exception(f"invalid pi hat filters {min} {max}")

sample_ids = dataset.s.collect()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're making multiple passes over dataset, we probably want to checkpoint it before this first pass.

Comment on lines +173 to +175
NOT_MISSING = IS_HOM_REF + IS_HET + IS_HOM_VAR

total_possible_ibs = NOT_MISSING.T @ NOT_MISSING

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if checkpointing NOT_MISSING before the matmul if worth the cost of the write. Might try benchmarking both ways.

ibs1_pre = (IS_HET.T @ is_not_het).checkpoint(hl.utils.new_temp_file())
ibs1 = ibs1_pre + ibs1_pre.T

ibs2 = total_possible_ibs - ibs0 - ibs1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to checkpoint ibs0, ibs1, ibs2. But again, it's not clear to me if the io or the repeated computation is worse.

@jigold

jigold commented Jan 31, 2023

Copy link
Copy Markdown
Contributor Author
# (n_samples, n_variants)

# Spark backend (1 core)
# (2500, 100000) =>   147.05s
# (5000, 10000) =>    113.28s

# Local backend
# (2500, 100000) =>   233.94s
# (5000, 10000) =>    489.46s

@danking danking merged commit eb0f48b into hail-is:main Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants