Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More options for control of behavior in read_genotype #33

Closed
eric-czech opened this issue Aug 24, 2020 · 16 comments
Closed

More options for control of behavior in read_genotype #33

eric-czech opened this issue Aug 24, 2020 · 16 comments

Comments

@eric-czech
Copy link

Given that it seems like the intention with bgen in practice is to often use less than 20 bits for storage of discretized probabilities, per the paper, do you think it would be reasonable to use 32-bit floats instead of 64 for the numpy arrays those probabilities are read into @horta?

I'm looking at this and wondering what it takes to make that dtype a parameter, or if it shouldn't just be float32 all the time:

probs = full((nsamples, ncombs), nan, dtype=float64)
lib.bgen_genotype_read(genotype, ffi.cast("double *", probs.ctypes.data))

I'm also wondering if making it possible to skip reading some of these other fields would speed things up appreciably:

phased = lib.bgen_genotype_phased(genotype)
ploidy = full(nsamples, 0, dtype=uint8)
lib.read_ploidy(genotype, ffi.cast("uint8_t *", ploidy.ctypes.data), nsamples)
missing = full(nsamples, 0, dtype=bool)
lib.read_missing(genotype, ffi.cast("bool *", missing.ctypes.data), nsamples)

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 24, 2020 via email

@eric-czech
Copy link
Author

Thanks @CarlKCarlK! Let's see:

I haven’t look at the new CBGEN API

Looks to be the same: https://github.com/limix/cbgen/blob/master/cbgen/_bgen_file.py#L70

Bgen2 allocates a (relatively tiny) float64 buffer that it reuses across samples [until/unless it sees the number of combinations changes, which often never happens.] It then copies from that buffer into an array of whatever dtype and order the user wants.

👍

“If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with dtype.
(Test with your data to confirm you are not losing any precision.) The approximate relationship is:
* BGEN compression 1 to 10 bits: dtype ='float16'
* BGEN compression 11 to 23 bits: dtype ='float32'
* BGEN compression 24 to 32 bits: dtype ='float64' (default)”

Woa, where'd that come from?! I've been wondering that very thing recently.

Aside: The Bed reader’s C++ code offers direct support for {float32,float64,int8}x{F,C}

How does the int8 option work? If there was a way to get the encoded values (i.e. before they're converted back to probabilities), that would be amazing. I'm basically working through how to redo the encoding in sgkit-dev/sgkit-bgen#14 so bypassing that altogether would be great.

It was a pain to “templatize” the C++ code with macros.

Very cool. I figured that would be a pain. How do you do it out of curiosity?

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 24, 2020 via email

@eric-czech
Copy link
Author

Just to clarify that it is Plink Bed-reader, not the Bgen reader that offers direct support for {float32,float64,int8}x{F,C}
(The names are confusing. I’ll start prefacing “bed” with “plink bed”.]

Whoops, read that too quickly. I see.

@CarlKCarlK
Copy link
Collaborator

If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with dtype.
(Test with your data to confirm you are not losing any precision.) The approximate relationship is:

  • BGEN compression 1 to 10 bits: dtype ='float16'
  • BGEN compression 11 to 23 bits: dtype ='float32'
  • BGEN compression 24 to 32 bits: dtype ='float64' (default)”

Woa, where'd that come from?! I've been wondering that very thing recently.

I think I did some crazy thing with my BGEN writer (which uses QCTOOL under the covers) and Excel. However, you can shortcut that (at least to +- 1) with this table in Wikipedia https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation and its "bit precision" column.

@CarlKCarlK
Copy link
Collaborator

(This is an aside that refers to the PLINK Bed Reader)

It was a pain to “templatize” the C++ code with macros.

Very cool. I figured that would be a pain. How do you do it out of curiosity?

https://github.com/fastlmm/bed-reader/blob/master/bed_reader/CPlinkBedFile.cpp
Does nothing but set some macros and then include
https://github.com/fastlmm/bed-reader/blob/master/bed_reader/CPlinkBedFileT.cpp
six times (one time for each combination).

e.g.

#define REAL double
#define ORDERC
#undef ORDERF
#undef MISSING_VALUE
#define SUFFIX(NAME) NAME ## doubleCAAA
#include "CPlinkBedFileT.cpp"
#undef REAL
#undef SUFFIX

#define REAL float
#define ORDERC
#undef ORDERF
#undef MISSING_VALUE
#define SUFFIX(NAME) NAME ## floatCAAA
#include "CPlinkBedFileT.cpp"
#undef REAL
#undef SUFFIX

#define REAL double
[...and so on...]

@horta
Copy link
Collaborator

horta commented Aug 24, 2020

As Carl mentioned, it is possible. His API already does that.
Going to refer this issue on cbgen as I would need to have that option there first.

@horta
Copy link
Collaborator

horta commented Aug 24, 2020

As Carl hinted to, bgen file types uses a variable number of bits to encode probability: from 1 bits to 32 bits. BGEN format specify how to convert that sequence of bits into a real number. It does not specify the floating point format but I take it as being double-precision:

And I take the suggestion given by their specification (https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html) to renormalize the resulting number to sum to one:

I don't like that part of bgen because it is not precise enough: someone can create another bgen reader that will give slightly different probability numbers.

@eric-czech
Copy link
Author

I think I did some crazy thing with my BGEN writer (which uses QCTOOL under the covers) and Excel. However, you can shortcut that (at least to +- 1) with this table in Wikipedia https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation and its "bit precision" column.

Thanks @CarlKCarlK. Very handy reference.

I don't like that part of bgen because it is not precise enough: someone can create another bgen reader that will give slightly different probability numbers.

I'm a little confused on that one @horta -- don't all readers have to create one of the genotype probabilities as 1 - the sum of the other probabilities since one of them is always left out of the file? Or do you mean that another reader might choose to decode the ints in the file as float32 instead of say float64 before doing the subtraction?

@eric-czech
Copy link
Author

Oh and @horta, do you think skipping reads of phasing, ploidy, and missing fields is still worth adding (the probability precision discussion aside)?

@horta
Copy link
Collaborator

horta commented Sep 2, 2020

I don't like that part of bgen because it is not precise enough: someone can create another bgen reader that will give slightly different probability numbers.

I'm a little confused on that one @horta -- don't all readers have to create one of the genotype probabilities as 1 - the sum of the other probabilities since one of them is always left out of the file? Or do you mean that another reader might choose to decode the ints in the file as float32 instead of say float64 before doing the subtraction?

The bgen specification does not tell what a floating-point to use (I guess they had in mind binary64 of IEEE 754 standard, which is the common one when we use double in C) nor the order of the arithmetic operations on the floating-point. And at the end it says

In practice we there may be some rounding error in probabilities input into the BGEN format. We therefore renormalise input probabilities to sum to one.

Should we normalize it or not? If so, still the arithmetic operations order is important to guarantee determinism.

@horta
Copy link
Collaborator

horta commented Sep 2, 2020

Oh and @horta, do you think skipping reads of phasing, ploidy, and missing fields is still worth adding (the probability precision discussion aside)?

Yes, I will work on that tonight. Should be quite easy to implement. And then finally start using cbgen on bgen-reader.

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Sep 2, 2020 via email

@horta
Copy link
Collaborator

horta commented Sep 3, 2020

cbgen now accepts probability precision (either 32 or 64 bits) and is able to read probability only if desired: https://cbgen.readthedocs.io/en/stable/bgen_file.html#cbgen.bgen_file.read_probability

@eric-czech
Copy link
Author

eric-czech commented Sep 3, 2020

Eric, my understanding is that surprisingly, they don’t leave that one probability out of the file!

Hm is this perhaps true for phased probabilities but not unphased? I'm looking at this part of the spec:

For unphased data ... Probabilities are stored in colex order of these vectors. The last probability (corresponding the the K-th allele homozygotes) is not stored

So at the end, there is a normalization step to make the distribution “more right”.

Gotcha, I can see how that makes sense after the decoding now. Though I'm still thinking of this as being implicit in the 1 - other probabilities for unphased calls but as a "divide by the sum" operation for phased calls. Let me know if that's wrong.

cbgen now accepts probability precision (either 32 or 64 bits) and is able to read probability only if desired

😁 thanks!

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Sep 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants