Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] IBD implemented in terms of block matrices #12629

Merged
merged 10 commits into from
Feb 8, 2023

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Jan 27, 2023

Patrick -- can you give this a careful look over? Once we're happy with it, I'll redo the benchmarking as a last sanity check.

@jigold jigold force-pushed the ibd-block-matrix-python branch from 3c412f3 to ca1ba01 Compare January 27, 2023 18:14
Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice and clean. I only see a few little things to clean up.

'min': min,
'max': max,
})).persist()

require_col_key_str(dataset, 'identity_by_descent')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want this before the spark backend case, don't we?

Comment on lines 125 to 126
dataset = dataset.select_cols().select_globals().select_entries('GT')
dataset = require_biallelic(dataset, 'ibd')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate lines

Comment on lines 173 to 176
IS_HOM_REF = BlockMatrix.from_entry_expr(dataset.is_hom_ref).checkpoint(hl.utils.new_temp_file())
IS_HET = BlockMatrix.from_entry_expr(dataset.is_het).checkpoint(hl.utils.new_temp_file())
IS_HOM_VAR = BlockMatrix.from_entry_expr(dataset.is_hom_var).checkpoint(hl.utils.new_temp_file())
NOT_MISSING = BlockMatrix.from_entry_expr(dataset.is_not_missing).checkpoint(hl.utils.new_temp_file())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the expense of from_entry_expr, I think it's worth doing the minimum necessary. These four are related by not_missing = is_hom_ref + is_het + is_hom_var, so any one can be computed from the other three.

@jigold
Copy link
Contributor Author

jigold commented Jan 30, 2023

@patrick-schultz Are you okay with where the checkpoints are? If yes, then I'll do one last set of benchmarks and then this is ready to go!

if not 0 <= min <= max <= 1:
raise Exception(f"invalid pi hat filters {min} {max}")

sample_ids = dataset.s.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're making multiple passes over dataset, we probably want to checkpoint it before this first pass.

Comment on lines 173 to 175
NOT_MISSING = IS_HOM_REF + IS_HET + IS_HOM_VAR

total_possible_ibs = NOT_MISSING.T @ NOT_MISSING
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if checkpointing NOT_MISSING before the matmul if worth the cost of the write. Might try benchmarking both ways.

ibs1_pre = (IS_HET.T @ is_not_het).checkpoint(hl.utils.new_temp_file())
ibs1 = ibs1_pre + ibs1_pre.T

ibs2 = total_possible_ibs - ibs0 - ibs1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to checkpoint ibs0, ibs1, ibs2. But again, it's not clear to me if the io or the repeated computation is worse.

@jigold
Copy link
Contributor Author

jigold commented Jan 31, 2023

# (n_samples, n_variants)

# Spark backend (1 core)
# (2500, 100000) =>   147.05s
# (5000, 10000) =>    113.28s

# Local backend
# (2500, 100000) =>   233.94s
# (5000, 10000) =>    489.46s

@danking danking merged commit eb0f48b into hail-is:main Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants