-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import_bgen scan massively slower in 0.2 than 0.1 #3862
Comments
we can get the performance back in the short term with a 2-line change moving decompression from advance() to getValue() -- the |
I should have caught this in review, though: https://github.com/hail-is/hail/pull/3783/files |
I think Tim's suggestion and Cotton's #1 are the same, basically? Stash the (possibly) uncompressed bytes in If this is impacting people, we should do that because it seems low-risk and high-value. As I think we all do, I prefer #3 as the long term solution. I found spreading the code across two methods a little confusing. I think ideally there would be just one method that decodes and writes into the RVB. I can pick up a proper re-write this/next week. |
yes, this is Cotton's #1! I was just curious what had changed between 0.1 and 0.2, because I thought that we had the lazy behavior until recently. |
Yeah, the laziness worked until just recently. I broke the laziness when I refactored things for Caitlin. |
this is resolved. |
I tracked down why this is happening.
The old code stored the (compressed) genotype data per variant in a buffer and decoded it in BgenRecord.getValue.
The new code decodes eagerly, but only if the entries are needed. I assume the intention was to mark the entries as unneeded during the scan, but not when decoding the actual values, but this wasn't done. It isn't done easily, either, since we can't set a per-Hadoop import configuration, see: #3861.
Options:
I personally vote for the latter.
The text was updated successfully, but these errors were encountered: