More options for control of behavior in read_genotype #33

eric-czech · 2020-08-24T11:38:14Z

Given that it seems like the intention with bgen in practice is to often use less than 20 bits for storage of discretized probabilities, per the paper, do you think it would be reasonable to use 32-bit floats instead of 64 for the numpy arrays those probabilities are read into @horta?

I'm looking at this and wondering what it takes to make that dtype a parameter, or if it shouldn't just be float32 all the time:

bgen-reader-py/bgen_reader/_bgen_file.py

Lines 70 to 71 in eceda00

    
           probs = full((nsamples, ncombs), nan, dtype=float64) 
        
           lib.bgen_genotype_read(genotype, ffi.cast("double *", probs.ctypes.data))

I'm also wondering if making it possible to skip reading some of these other fields would speed things up appreciably:

bgen-reader-py/bgen_reader/_bgen_file.py

Lines 73 to 79 in eceda00

    
           phased = lib.bgen_genotype_phased(genotype) 
        
           ploidy = full(nsamples, 0, dtype=uint8) 
        
           lib.read_ploidy(genotype, ffi.cast("uint8_t *", ploidy.ctypes.data), nsamples) 
        
           missing = full(nsamples, 0, dtype=bool) 
        
           lib.read_missing(genotype, ffi.cast("bool *", missing.ctypes.data), nsamples)

CarlKCarlK · 2020-08-24T14:49:37Z

The Bgen2, NumPy API, illustrates the two optimizations that Eric suggests (more or less): https://github.com/limix/bgen-reader-py/blob/eceda00c6f34fa66882c39e210407e50f5f8ebd8/bgen_reader/_bgen2.py#L563-587 * (Disclaimer: I haven’t look at the new CBGEN API, to see if anything should change.) * Bgen2 allocates a (relatively tiny) float64 buffer that it reuses across samples [until/unless it sees the number of combinations changes, which often never happens.] It then copies from that buffer into an array of whatever dtype and order the user wants. * Bgen2 reads only the probabilities (unless the user explicitly ask for the other info). * Its documentation offers users this advice on dtype (which reflects my option that this is mostly a way to save memory, not time. Also, float64 is needed in some cases.) “If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with ``dtype``. (Test with your data to confirm you are not losing any precision.) The approximate relationship is: * BGEN compression 1 to 10 bits: ``dtype`` ='float16' * BGEN compression 11 to 23 bits: ``dtype`` ='float32' * BGEN compression 24 to 32 bits: ``dtype`` ='float64' (default)” * Aside: The Bed reader’s C++ code offers direct support for {float32,float64,int8}x{F,C}. It was a pain to “templatize” the C++ code with macros. In the short term, you may want to use the Bgen2 API. In the medium term, you may decide to implement everything. * Carl From: Eric Czech<mailto:notifications@github.com> Sent: Monday, August 24, 2020 4:38 AM To: limix/bgen-reader-py<mailto:bgen-reader-py@noreply.github.com> Cc: Subscribed<mailto:subscribed@noreply.github.com> Subject: [limix/bgen-reader-py] More options for control of behavior in read_genotype (#33) Given that it seems like the intention with bgen in practice is to often use less than 20 bits for storage of discretized probabilities, per the paper<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2F10.1101%2F308296v2.full.pdf&data=02%7C01%7C%7Cb603a78a4814476f73b508d848223923%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338659100802284&sdata=4NB0SEpGeh79a2%2Fw5W9%2Fozxmsf1ggvLD33wcXvvBZSA%3D&reserved=0>, do you think it would be reasonable to use 32-bit floats instead of 64 for the numpy arrays those probabilities are read into @horta<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhorta&data=02%7C01%7C%7Cb603a78a4814476f73b508d848223923%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338659100812279&sdata=1g9aVQfv9et6L%2FEclnpoUkQZPcbp4pT19rsmKIcuUUk%3D&reserved=0>? I'm looking at this and wondering what it takes to make that dtype a parameter, or if it shouldn't just be float32 all the time: https://github.com/limix/bgen-reader-py/blob/eceda00c6f34fa66882c39e210407e50f5f8ebd8/bgen_reader/_bgen_file.py#L70-L71<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fblob%2Feceda00c6f34fa66882c39e210407e50f5f8ebd8%2Fbgen_reader%2F_bgen_file.py%23L70-L71&data=02%7C01%7C%7Cb603a78a4814476f73b508d848223923%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338659100812279&sdata=02Hh3aJaMPy5997CY%2FHT%2FrWbImbTbp6S%2FF18wPG2mEM%3D&reserved=0> I'm also wondering if making it possible to skip reading some of these other fields would speed things up appreciably: https://github.com/limix/bgen-reader-py/blob/eceda00c6f34fa66882c39e210407e50f5f8ebd8/bgen_reader/_bgen_file.py#L73-L79<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fblob%2Feceda00c6f34fa66882c39e210407e50f5f8ebd8%2Fbgen_reader%2F_bgen_file.py%23L73-L79&data=02%7C01%7C%7Cb603a78a4814476f73b508d848223923%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338659100822273&sdata=j2wBc25QghlfxHtLNfCYRUFu4ZP1bAi1khAxnmiF5zk%3D&reserved=0> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F33&data=02%7C01%7C%7Cb603a78a4814476f73b508d848223923%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338659100822273&sdata=JP33JjL8dokZc%2BQKPPeF%2B2po%2BjJ0lXcB6Zdx%2FSSxVw4%3D&reserved=0>, or unsubscribe<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5GKMEPCXFHZXKFNNTSCJGLLANCNFSM4QJM4RWA&data=02%7C01%7C%7Cb603a78a4814476f73b508d848223923%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338659100832267&sdata=BQYkg0dbVKhmLI28b0xT%2BdCluvJHFV9ihN%2FtV6ClEAw%3D&reserved=0>.

eric-czech · 2020-08-24T16:02:17Z

Thanks @CarlKCarlK! Let's see:

I haven’t look at the new CBGEN API

Looks to be the same: https://github.com/limix/cbgen/blob/master/cbgen/_bgen_file.py#L70

Bgen2 allocates a (relatively tiny) float64 buffer that it reuses across samples [until/unless it sees the number of combinations changes, which often never happens.] It then copies from that buffer into an array of whatever dtype and order the user wants.

👍

“If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with dtype.
(Test with your data to confirm you are not losing any precision.) The approximate relationship is:
* BGEN compression 1 to 10 bits: dtype ='float16'
* BGEN compression 11 to 23 bits: dtype ='float32'
* BGEN compression 24 to 32 bits: dtype ='float64' (default)”

Woa, where'd that come from?! I've been wondering that very thing recently.

Aside: The Bed reader’s C++ code offers direct support for {float32,float64,int8}x{F,C}

How does the int8 option work? If there was a way to get the encoded values (i.e. before they're converted back to probabilities), that would be amazing. I'm basically working through how to redo the encoding in sgkit-dev/sgkit-bgen#14 so bypassing that altogether would be great.

It was a pain to “templatize” the C++ code with macros.

Very cool. I figured that would be a pain. How do you do it out of curiosity?

CarlKCarlK · 2020-08-24T16:05:11Z

Just to clarify that it is Plink Bed-reader, not the Bgen reader that offers direct support for {float32,float64,int8}x{F,C} (The names are confusing. I’ll start prefacing “bed” with “plink bed”.] From: Eric Czech<mailto:notifications@github.com> Sent: Monday, August 24, 2020 9:02 AM To: limix/bgen-reader-py<mailto:bgen-reader-py@noreply.github.com> Cc: Carl Kadie<mailto:carlk@msn.com>; Mention<mailto:mention@noreply.github.com> Subject: Re: [limix/bgen-reader-py] More options for control of behavior in read_genotype (#33) Thanks @CarlKCarlK<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548935215&sdata=5h1D0wKfAw3La1ARm5ZmesklQr%2FjucsrADsNO9XCxGc%3D&reserved=0>! Let's see: I haven’t look at the new CBGEN API Looks to be the same: https://github.com/limix/cbgen/blob/master/cbgen/_bgen_file.py#L70<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fcbgen%2Fblob%2Fmaster%2Fcbgen%2F_bgen_file.py%23L70&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548945209&sdata=Y4%2FCLjjGXbKDv6Yh2sDHJw2frOf4zG2f%2B2n5rm%2FWFKo%3D&reserved=0> Bgen2 allocates a (relatively tiny) float64 buffer that it reuses across samples [until/unless it sees the number of combinations changes, which often never happens.] It then copies from that buffer into an array of whatever dtype and order the user wants. 👍 “If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with dtype. (Test with your data to confirm you are not losing any precision.) The approximate relationship is: * BGEN compression 1 to 10 bits: dtype ='float16' * BGEN compression 11 to 23 bits: dtype ='float32' * BGEN compression 24 to 32 bits: dtype ='float64' (default)” Woa, where'd that come from?! I've been wondering that very thing recently. Aside: The Bed reader’s C++ code offers direct support for {float32,float64,int8}x{F,C} How does the int8 option work? If there was a way to get the encoded values (i.e. before they're converted back to probabilities), that would be amazing. I'm basically working through how to redo the encoding in sgkit-dev/sgkit-bgen#14<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpystatgen%2Fsgkit-bgen%2Fissues%2F14&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548945209&sdata=0NaU7HAQ2CTUlcp%2FQUra6xOzHJeX3TpIb4dvi%2F%2B5vfo%3D&reserved=0> so bypassing that altogether would be great. It was a pain to “templatize” the C++ code with macros. Very cool. I figured that would be a pain. How do you do it out of curiosity? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F33%23issuecomment-679216847&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548955203&sdata=mIZZL8cI%2B6IdsuFWkvJJQuynGoIW2usypqc9WoGSz20%3D&reserved=0>, or unsubscribe<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5WQY7DMTH6UYNTKJDSCKFJTANCNFSM4QJM4RWA&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548955203&sdata=ebEbEQxlofPiVfiBdP3hoRadyth1YgWQNTnkKIOjXdM%3D&reserved=0>.

eric-czech · 2020-08-24T16:10:41Z

Just to clarify that it is Plink Bed-reader, not the Bgen reader that offers direct support for {float32,float64,int8}x{F,C}
(The names are confusing. I’ll start prefacing “bed” with “plink bed”.]

Whoops, read that too quickly. I see.

CarlKCarlK · 2020-08-24T16:42:16Z

If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with dtype.
(Test with your data to confirm you are not losing any precision.) The approximate relationship is:

BGEN compression 1 to 10 bits: dtype ='float16'

BGEN compression 11 to 23 bits: dtype ='float32'

BGEN compression 24 to 32 bits: dtype ='float64' (default)”

Woa, where'd that come from?! I've been wondering that very thing recently.

I think I did some crazy thing with my BGEN writer (which uses QCTOOL under the covers) and Excel. However, you can shortcut that (at least to +- 1) with this table in Wikipedia https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation and its "bit precision" column.

CarlKCarlK · 2020-08-24T16:48:09Z

(This is an aside that refers to the PLINK Bed Reader)

It was a pain to “templatize” the C++ code with macros.

Very cool. I figured that would be a pain. How do you do it out of curiosity?

https://github.com/fastlmm/bed-reader/blob/master/bed_reader/CPlinkBedFile.cpp
Does nothing but set some macros and then include
https://github.com/fastlmm/bed-reader/blob/master/bed_reader/CPlinkBedFileT.cpp
six times (one time for each combination).

e.g.

#define REAL double
#define ORDERC
#undef ORDERF
#undef MISSING_VALUE
#define SUFFIX(NAME) NAME ## doubleCAAA
#include "CPlinkBedFileT.cpp"
#undef REAL
#undef SUFFIX

#define REAL float
#define ORDERC
#undef ORDERF
#undef MISSING_VALUE
#define SUFFIX(NAME) NAME ## floatCAAA
#include "CPlinkBedFileT.cpp"
#undef REAL
#undef SUFFIX

#define REAL double
[...and so on...]

horta · 2020-08-24T23:16:44Z

As Carl mentioned, it is possible. His API already does that.
Going to refer this issue on cbgen as I would need to have that option there first.

horta · 2020-08-24T23:46:06Z

As Carl hinted to, bgen file types uses a variable number of bits to encode probability: from 1 bits to 32 bits. BGEN format specify how to convert that sequence of bits into a real number. It does not specify the floating point format but I take it as being double-precision:

And I take the suggestion given by their specification (https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html) to renormalize the resulting number to sum to one:

I don't like that part of bgen because it is not precise enough: someone can create another bgen reader that will give slightly different probability numbers.

eric-czech · 2020-09-02T00:10:08Z

I think I did some crazy thing with my BGEN writer (which uses QCTOOL under the covers) and Excel. However, you can shortcut that (at least to +- 1) with this table in Wikipedia https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation and its "bit precision" column.

Thanks @CarlKCarlK. Very handy reference.

I don't like that part of bgen because it is not precise enough: someone can create another bgen reader that will give slightly different probability numbers.

I'm a little confused on that one @horta -- don't all readers have to create one of the genotype probabilities as 1 - the sum of the other probabilities since one of them is always left out of the file? Or do you mean that another reader might choose to decode the ints in the file as float32 instead of say float64 before doing the subtraction?

eric-czech · 2020-09-02T00:13:23Z

Oh and @horta, do you think skipping reads of phasing, ploidy, and missing fields is still worth adding (the probability precision discussion aside)?

horta · 2020-09-02T08:50:07Z

I don't like that part of bgen because it is not precise enough: someone can create another bgen reader that will give slightly different probability numbers.

I'm a little confused on that one @horta -- don't all readers have to create one of the genotype probabilities as 1 - the sum of the other probabilities since one of them is always left out of the file? Or do you mean that another reader might choose to decode the ints in the file as float32 instead of say float64 before doing the subtraction?

The bgen specification does not tell what a floating-point to use (I guess they had in mind binary64 of IEEE 754 standard, which is the common one when we use double in C) nor the order of the arithmetic operations on the floating-point. And at the end it says

In practice we there may be some rounding error in probabilities input into the BGEN format. We therefore renormalise input probabilities to sum to one.

Should we normalize it or not? If so, still the arithmetic operations order is important to guarantee determinism.

horta · 2020-09-02T08:51:28Z

Oh and @horta, do you think skipping reads of phasing, ploidy, and missing fields is still worth adding (the probability precision discussion aside)?

Yes, I will work on that tonight. Should be quite easy to implement. And then finally start using cbgen on bgen-reader.

CarlKCarlK · 2020-09-02T15:12:14Z

Eric writes: * don't all readers have to create one of the genotype probabilities as 1 - the sum of the other probabilities since one of them is always left out of the file? Eric, my understanding is that surprisingly, they don’t leave that one probability out of the file! Danilo says they include them all (which for phased sugarcane could be 100’s). They expect them to add up to about 1.0, but it won’t be exact because of precision limits and rounding. So at the end, there is a normalization step to make the distribution “more right”. * Carl

horta · 2020-09-03T18:06:40Z

cbgen now accepts probability precision (either 32 or 64 bits) and is able to read probability only if desired: https://cbgen.readthedocs.io/en/stable/bgen_file.html#cbgen.bgen_file.read_probability

eric-czech · 2020-09-03T18:15:47Z

Eric, my understanding is that surprisingly, they don’t leave that one probability out of the file!

Hm is this perhaps true for phased probabilities but not unphased? I'm looking at this part of the spec:

For unphased data ... Probabilities are stored in colex order of these vectors. The last probability (corresponding the the K-th allele homozygotes) is not stored

So at the end, there is a normalization step to make the distribution “more right”.

Gotcha, I can see how that makes sense after the decoding now. Though I'm still thinking of this as being implicit in the 1 - other probabilities for unphased calls but as a "divide by the sum" operation for phased calls. Let me know if that's wrong.

cbgen now accepts probability precision (either 32 or 64 bits) and is able to read probability only if desired

😁 thanks!

CarlKCarlK · 2020-09-03T20:50:39Z

Eric, Whoops, I mis-remembered. Thanks for checking the spec. * Carl From: Eric Czech <notifications@github.com> Sent: Thursday, September 03, 2020 11:16 AM To: limix/bgen-reader-py <bgen-reader-py@noreply.github.com> Cc: Carl Kadie <carlk@msn.com>; Mention <mention@noreply.github.com> Subject: Re: [limix/bgen-reader-py] More options for control of behavior in read_genotype (#33) Eric, my understanding is that surprisingly, they don’t leave that one probability out of the file! Hm is this perhaps true for phased probabilities but not unphased? I'm looking at this part of the spec: For unphased data ... Probabilities are stored in colex order of these vectors. The last probability (corresponding the the K-th allele homozygotes) is not stored So at the end, there is a normalization step to make the distribution “more right”. Gotcha, I can see how that makes sense after the decoding now. Though I'm still thinking of this as being implicit in the 1 - other probabilities for unphased calls but as a "divide by the sum" operation for phased calls. Let me know if that's wrong. cbgen now accepts probability precision (either 32 or 64 bits) and is able to read probability only if desired 😁 thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F33%23issuecomment-686664420&data=02%7C01%7C%7C3fb2ec2a24ff45b3a90608d850356ca5%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637347537662213895&sdata=fnJYEkdui0rcYPJw6IrAEFz3cnj6V4tGMtM%2FKNqQ7So%3D&reserved=0>, or unsubscribe<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6HIDXKAMSAS3TZPTLSD7MOLANCNFSM4QJM4RWA&data=02%7C01%7C%7C3fb2ec2a24ff45b3a90608d850356ca5%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637347537662223888&sdata=MBHcvcpv8EYEP2G0Ty7ZZzxX67xexfuI50La7wb9A0I%3D&reserved=0>.

horta mentioned this issue Aug 24, 2020

Let the user specify the probability precision? limix/cbgen#2

Closed

horta mentioned this issue Aug 24, 2020

Let the user fetch probabilities only? limix/cbgen#3

Closed

horta closed this as completed Sep 3, 2020

horta reopened this Sep 3, 2020

eric-czech mentioned this issue Sep 3, 2020

Use cbgen instead of bgen_reader sgkit-dev/sgkit-bgen#20

Closed

eric-czech closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More options for control of behavior in read_genotype #33

More options for control of behavior in read_genotype #33

eric-czech commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020 via email

eric-czech commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020 via email

eric-czech commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020

horta commented Aug 24, 2020

horta commented Aug 24, 2020

eric-czech commented Sep 2, 2020

eric-czech commented Sep 2, 2020

horta commented Sep 2, 2020

horta commented Sep 2, 2020

CarlKCarlK commented Sep 2, 2020 via email

horta commented Sep 3, 2020

eric-czech commented Sep 3, 2020 •

edited

Loading

CarlKCarlK commented Sep 3, 2020 via email

More options for control of behavior in read_genotype #33

More options for control of behavior in read_genotype #33

Comments

eric-czech commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020 via email

eric-czech commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020 via email

eric-czech commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020

CarlKCarlK commented Aug 24, 2020

horta commented Aug 24, 2020

horta commented Aug 24, 2020

eric-czech commented Sep 2, 2020

eric-czech commented Sep 2, 2020

horta commented Sep 2, 2020

horta commented Sep 2, 2020

CarlKCarlK commented Sep 2, 2020 via email

horta commented Sep 3, 2020

eric-czech commented Sep 3, 2020 • edited Loading

CarlKCarlK commented Sep 3, 2020 via email

eric-czech commented Sep 3, 2020 •

edited

Loading