import_bgen scan massively slower in 0.2 than 0.1 #3862

cseed · 2018-06-28T15:51:10Z

I tracked down why this is happening.

The old code stored the (compressed) genotype data per variant in a buffer and decoded it in BgenRecord.getValue.

The new code decodes eagerly, but only if the entries are needed. I assume the intention was to mark the entries as unneeded during the scan, but not when decoding the actual values, but this wasn't done. It isn't done easily, either, since we can't set a per-Hadoop import configuration, see: #3861.

Options:

go back to the old code that stashes the compressed value and evaluates lazily,
have separate InputFormat/RecordReader for scan and decode,
stop using Hadoop InputFormat to load BGEN and just code it in directly in Spark, where it is trivial to pass different parameters to scan and decode.

I personally vote for the latter.

tpoterba · 2018-07-02T10:41:38Z

we can get the performance back in the short term with a 2-line change moving decompression from advance() to getValue() -- the data field isn't used anywhere in advance.

tpoterba · 2018-07-02T10:43:55Z

I should have caught this in review, though: https://github.com/hail-is/hail/pull/3783/files

danking · 2018-07-02T12:25:00Z

I think Tim's suggestion and Cotton's #1 are the same, basically? Stash the (possibly) uncompressed bytes in data and then decompress only in getValue if necessary. This gets us back to previous performance, but we still pay to copy the data even if we never read it.

If this is impacting people, we should do that because it seems low-risk and high-value.

As I think we all do, I prefer #3 as the long term solution. I found spreading the code across two methods a little confusing. I think ideally there would be just one method that decodes and writes into the RVB. I can pick up a proper re-write this/next week.

tpoterba · 2018-07-02T12:32:59Z

yes, this is Cotton's #1! I was just curious what had changed between 0.1 and 0.2, because I thought that we had the lazy behavior until recently.

danking · 2018-07-02T12:57:55Z

Yeah, the laziness worked until just recently. I broke the laziness when I refactored things for Caitlin.

tpoterba · 2018-07-18T17:26:13Z

this is resolved.

cseed assigned danking Jun 28, 2018

tpoterba closed this as completed Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import_bgen scan massively slower in 0.2 than 0.1 #3862

import_bgen scan massively slower in 0.2 than 0.1 #3862

cseed commented Jun 28, 2018

tpoterba commented Jul 2, 2018

tpoterba commented Jul 2, 2018

danking commented Jul 2, 2018

tpoterba commented Jul 2, 2018

danking commented Jul 2, 2018

tpoterba commented Jul 18, 2018

import_bgen scan massively slower in 0.2 than 0.1 #3862

import_bgen scan massively slower in 0.2 than 0.1 #3862

Comments

cseed commented Jun 28, 2018

tpoterba commented Jul 2, 2018

tpoterba commented Jul 2, 2018

danking commented Jul 2, 2018

tpoterba commented Jul 2, 2018

danking commented Jul 2, 2018

tpoterba commented Jul 18, 2018