Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_bgen scan massively slower in 0.2 than 0.1 #3862

Closed
cseed opened this issue Jun 28, 2018 · 6 comments
Closed

import_bgen scan massively slower in 0.2 than 0.1 #3862

cseed opened this issue Jun 28, 2018 · 6 comments
Assignees

Comments

@cseed
Copy link
Collaborator

cseed commented Jun 28, 2018

I tracked down why this is happening.

The old code stored the (compressed) genotype data per variant in a buffer and decoded it in BgenRecord.getValue.

The new code decodes eagerly, but only if the entries are needed. I assume the intention was to mark the entries as unneeded during the scan, but not when decoding the actual values, but this wasn't done. It isn't done easily, either, since we can't set a per-Hadoop import configuration, see: #3861.

Options:

  • go back to the old code that stashes the compressed value and evaluates lazily,
  • have separate InputFormat/RecordReader for scan and decode,
  • stop using Hadoop InputFormat to load BGEN and just code it in directly in Spark, where it is trivial to pass different parameters to scan and decode.

I personally vote for the latter.

@tpoterba
Copy link
Contributor

tpoterba commented Jul 2, 2018

we can get the performance back in the short term with a 2-line change moving decompression from advance() to getValue() -- the data field isn't used anywhere in advance.

@tpoterba
Copy link
Contributor

tpoterba commented Jul 2, 2018

I should have caught this in review, though: https://github.com/hail-is/hail/pull/3783/files

@danking
Copy link
Contributor

danking commented Jul 2, 2018

I think Tim's suggestion and Cotton's #1 are the same, basically? Stash the (possibly) uncompressed bytes in data and then decompress only in getValue if necessary. This gets us back to previous performance, but we still pay to copy the data even if we never read it.

If this is impacting people, we should do that because it seems low-risk and high-value.

As I think we all do, I prefer #3 as the long term solution. I found spreading the code across two methods a little confusing. I think ideally there would be just one method that decodes and writes into the RVB. I can pick up a proper re-write this/next week.

@tpoterba
Copy link
Contributor

tpoterba commented Jul 2, 2018

yes, this is Cotton's #1! I was just curious what had changed between 0.1 and 0.2, because I thought that we had the lazy behavior until recently.

@danking
Copy link
Contributor

danking commented Jul 2, 2018

Yeah, the laziness worked until just recently. I broke the laziness when I refactored things for Caitlin.

@tpoterba
Copy link
Contributor

this is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants