Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logical NAs converted to FALSE with recent rhdf5 #58

Closed
hpages opened this issue Mar 27, 2020 · 9 comments
Closed

Logical NAs converted to FALSE with recent rhdf5 #58

hpages opened this issue Mar 27, 2020 · 9 comments

Comments

@hpages
Copy link
Contributor

hpages commented Mar 27, 2020

Hi Mike,

This was originally reported here.

In release (rhdf5 2.30.1):

library(rhdf5)
m <- matrix(c(FALSE, TRUE, NA, FALSE, TRUE, NA), nrow=2)
h5write(m, "test.h5", "M1")

h5ls("test.h5", all=TRUE)
#   group name         ltype corder_valid corder cset       otype num_attrs
# 0     /   M1 H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET         1
#    dclass         dtype  stype rank   dim maxdim
# 0 INTEGER H5T_STD_I32LE SIMPLE    2 2 x 3  2 x 3

h5read("test.h5", "M1")
#       [,1]  [,2] [,3]
# [1,] FALSE    NA TRUE
# [2,]  TRUE FALSE   NA
# Warning message:
# In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
#   integer value -2^63 replaced NA. See the section 'Large integer data types' in the 'rhdf5' vignette for more details.

In devel (rhdf5 2.31.6):

library(rhdf5)

m <- matrix(c(FALSE, TRUE, NA, FALSE, TRUE, NA), nrow=2)

h5write(m, "test.h5", "M1")

h5ls("test.h5", all=TRUE)
#   group name         ltype corder_valid corder cset       otype num_attrs
# 0     /   M1 H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET         1
#    dclass        dtype  stype rank   dim maxdim
# 0 INTEGER H5T_STD_U8LE SIMPLE    2 2 x 3  2 x 3

h5read("test.h5", "M1")
#       [,1]  [,2]  [,3]
# [1,] FALSE FALSE  TRUE
# [2,]  TRUE FALSE FALSE

I see that in devel you're now using H5 type H5T_STD_U8LE instead of H5T_STD_I32LE to store logical data (probably related to that discussion) and it makes a lot of sense to use an 8-bit type instead of a 32-bit type here. However the fact that H5T_STD_U8LE is an unsigned type is a problem because AFAIK the HDF5 library replaces negative values in the input (the NAs) with zeroes to make them "fit" in the unsigned type. This causes NAs to be stored as FALSEs.

I believe that if H5T_STD_I8LE was used instead of H5T_STD_U8LE then the HDF5 library would replace NAs with the most negative value in the H5T_STD_I8LE range (-128 ?) so I guess that would work. h5read() would still need to make sure to map -128 values back to INT_MIN (the value R uses to represent integer and logical NAs).

Thanks,
H.

@hpages
Copy link
Contributor Author

hpages commented Apr 30, 2020

Mike, any thought on this? Thanks!

@grimbough
Copy link
Owner

Sorry, been a bit swamped with other things coming up to release time. I'll bump this up the to-do list

@grimbough
Copy link
Owner

This is actually a little trickier than I expect, as the 8bit H5T_STD_I8LE gets read into a RAWSXP and even -128 gets converted to 0 at that point. I'll think about a strategy for handling this more tomorrow.

@grimbough
Copy link
Owner

This should be fixed in version 2.33.3:

library(rhdf5)

m <- matrix(c(FALSE, TRUE, NA, FALSE, TRUE, NA), nrow=2)
file <- tempfile(fileext = '.h5')

h5write(m, file, "M1")
h5ls(file, all=TRUE)
#>   group name         ltype corder_valid corder cset       otype num_attrs
#> 0     /   M1 H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET         1
#>    dclass        dtype  stype rank   dim maxdim
#> 0 INTEGER H5T_STD_I8LE SIMPLE    2 2 x 3  2 x 3
h5read(file, "M1")
#>       [,1]  [,2] [,3]
#> [1,] FALSE    NA TRUE
#> [2,]  TRUE FALSE   NA

It no longer prints the warning about converting a number to NA. For logicals you can only end up in the -128 → NA mapping if there's also an attribute with storage.mode == logical so I feel comfortable this isn't going to catch out many users that didn't create the original file in R.

This is a bug, so I'll give it a few days to make sure nothing funky happens in other packages, and then also patch the release branch.

@grimbough grimbough moved this from High priority issues to Done in Bioconductor rhdf5 package suite Jun 9, 2020
@hpages
Copy link
Contributor Author

hpages commented Jun 16, 2020

Yep, works as expected now.

Not sure why you have to read into a RAWSXP first. Wouldn't it be possible to alloc a LGLSXP on detection of the storage.mode == logical attribute and read directly into that?

PeteHaitch added a commit to PeteHaitch/DelayedMatrixStats that referenced this issue Jun 28, 2020
- Re-enabled after fix to HDF5Array in Bioconductor/HDF5Array#28 and rhdf5 in grimbough/rhdf5#58
PeteHaitch added a commit to PeteHaitch/DelayedMatrixStats that referenced this issue Jun 28, 2020
- Re-enabled after fix to HDF5Array in Bioconductor/HDF5Array#28 and rhdf5 in grimbough/rhdf5#58
@hpages
Copy link
Contributor Author

hpages commented Jul 3, 2020

@grimbough I was wondering if you have any plans to port the "NA handling" stuff to the RELEASE_3_11 branch. In particular, the RELEASE_3_11 branch still uses H5T_STD_U8LE to store logical data so the issue of NAs getting stored as FALSEs remains there. Thanks!

@grimbough
Copy link
Owner

Hi @hpages I think I've pushed the required changes to RELEASE_3_11 (rhdf5 version 2.32.2).

@hpages
Copy link
Contributor Author

hpages commented Jul 3, 2020

Great. Thanks Mike!

@hpages
Copy link
Contributor Author

hpages commented Jul 10, 2020

Yep, I can confirm that rhdf5 2.32.2 addresses the original issue. Thanks!

@hpages hpages closed this as completed Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants