Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make ld_prune fast again #5078

Merged
merged 33 commits into from Jan 24, 2019

Conversation

Projects
None yet
2 participants
@danking
Copy link
Collaborator

commented Jan 4, 2019

This gets ld_prune on the get_1kg data down to around 37s. That's still ~1000 times slower than plink.

mt = hl.read_matrix_table('repartitioned.mt')
pruned_tbl = hl.ld_prune(mt.GT, r2 = 0.2, bp_window_size = 1000000, memory_per_core = 1000)
pruned_tbl.write("pruned_tbl.ht", overwrite=True)

Performance Wins:

  • local ld prune returns an unkeyed, unsorted dataset, and ld_prune collects the relatively small number of variants locally instead of trying to do table joins (I'm doing the broadcast join optimization manually)
  • avoid key_by (and thus sort) of output of MIS, again we do a broadcast join
  • two unnecessary writes removed (at the cost of no debugging output)
  • maximal_independent_set no longer keys by, thus avoiding a sort

Minor Changes:

  • I don't set env vars anymore, so I need an easy way to pip install hail, so I added a gradle task for that and an associated file that does almost the same thing as deploy.sh. you should complain and make me consolidate these two files.

Big Data Test

I'm running a test on profile225 right now.


resolves #4506

@danking danking changed the title Fix 4506 make ld_prune fast again Jan 4, 2019

@tpoterba
Copy link
Collaborator

left a comment

looks great. Something to address

else:
variants_to_remove = hl.maximal_independent_set(entries.i, entries.j, keep=False)

variants_to_remove = variants_to_remove.collect()

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 7, 2019

Collaborator

can this be variants_to_remove.node.idx.collect()? That would transmit slightly less data.

This comment has been minimized.

Copy link
@danking

danking Jan 8, 2019

Author Collaborator

ah yeah, I basically want to annotate globals of the locally pruned table with with hl.agg.collect_as_set(variants_to_remove.node.idx), if I do

locally_pruned_table.annotate_globals(variants_to_remove =
  hl.agg.collect_as_set(variants_to_remove.node.idx))

does that work? They're different tables, but I'd prefer to not go through python if I can avoid it.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 8, 2019

Collaborator

that won't work, but you can use Konrad's christmas present!

locally_pruned_table.annotate_globals(variants_to_remove =
  variants_to_remove.aggregate(
    hl.agg.collect_as_set(variants_to_remove.node.idx), _localize=False))

This comment has been minimized.

Copy link
@danking

danking Jan 22, 2019

Author Collaborator

Used, thank you!

@@ -1213,7 +1213,7 @@ class BlockMatrix(val blocks: RDD[((Int, Int), BDM[Double])],
}
}

new Table(hc, entriesRDD, rvRowType, Array("i", "j"))
new Table(hc, entriesRDD, rvRowType, Array[String]())

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 7, 2019

Collaborator

this is technically a user-facing change, right? That's OK since this is experimental, but I'll remember to put it in the change log.

This comment has been minimized.

Copy link
@danking

danking Jan 22, 2019

Author Collaborator

Fixed by adding an optional keyed=True parameter in python which calls key_by if necessary.

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 8, 2019

update: took 160s on profile225 (2.0GB as .vcf.gz). The size input to LD Prune (after filtering and split-multi) is 700MB (as mt). 1KG is 16MB as an mt. There's clearly a lot of overhead for small datasets.

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 8, 2019

That's 30x increase in data size with a 3x increase in time. I'm gonna try on a 37GB dataset now.

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 8, 2019

@tpoterba see changes, I made it backwards compatible by adding a parameter. I'm using the Konrad xmas present now.

Let's not approve until I test against local UKBB and test in a cluster (I want to ensure it works at scale too).

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 9, 2019

I let ld prune run for a while on a 30GB bgen file. There are known issues here, but I made some changes to avoid some useless work but that included fusing the variant filtering with the ldprune. It's a single stage spark job.

It took 1.6 Hours to do a bit more than a third of this file. So that's about 5700 seconds for ~10 GB compared to 160s for 0.7 GB, which is 35:14. I hope to try this after the BGEN fixes to see what the scaling looks like.

Anyway, as a part of BlockMatrix IR changes that Daniel will work on, we'll look into why LD Prune isn't within 8x of PLINK.

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 17, 2019

Fixed a bunch of things, improved it a bit, ~30s for the small test now. when combined with the changes from #5107

@danking danking force-pushed the danking:fix-4506 branch 2 times, most recently from d96300e to 61fd47b Jan 17, 2019

addressed

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 22, 2019

I made three more improvements last week (and I don't plan to do anything else now):

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 22, 2019

Still takes ~30s on the 1kg example.

@danking danking force-pushed the danking:fix-4506 branch from d715a0d to cbcd4a8 Jan 23, 2019

@tpoterba
Copy link
Collaborator

left a comment

one comment

if entry_expr in mt._fields_inverse:
# FIXME: remove once select_entries on a field is free
field = mt._fields_inverse[entry_expr]
mt._write_block_matrix(path, overwrite, field, block_size)

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 24, 2019

Collaborator

This prevents field pruning. Emitting a mt.select_entries('field') would fix that

This comment has been minimized.

Copy link
@danking

danking Jan 24, 2019

Author Collaborator

addressed. I also removed the FIXME in favor of a punch-list issue: #5202

danking added some commits Jan 24, 2019

@danking danking merged commit a87e087 into hail-is:master Jan 24, 2019

1 check passed

hail-ci-0-1 successful build
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.