Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: example of DataFrame export to HDF5 and import into R #9636

Closed
joschkazj opened this issue Mar 12, 2015 · 7 comments

Comments

Projects
None yet
5 participants
@joschkazj
Copy link

commented Mar 12, 2015

When searching the web I didn't find any examples of a working pandas to R data transfer using HDF5 files, even though pandas's documentation mentions the used HDF5 format "can easily be imported into R using the rhdf5 library". The pandas export works as expected and I inspected the file format using the HDF group's viewer (HDFView).
After some experimentation I have a working sample for dataframe export from Python/pandas and import into R, which could be added to the documentation to help future users:

# Example of HDF5 export for R

import numpy as np
import pandas as pd

np.random.seed(1)
df = pd.DataFrame({"first": np.random.rand(100),
                   "second": np.random.rand(100),
                   "class": np.random.randint(0, 2, (100,))},
                   index=range(100))

print(df.head())

store = pd.HDFStore("transfer.hdf5", "w", complib=str("zlib"), complevel=5)
store.put("dataframe", df, data_columns=df.columns)
store.close()

Output:

   class     first    second
0      0  0.417022  0.326645
1      0  0.720324  0.527058
2      1  0.000114  0.885942
3      1  0.302333  0.357270
4      1  0.146756  0.908535
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)

data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")

columns = list()
for (idx in seq(data_paths)) {
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

Now you can import the DataFrame:

> data = loadhdf5data("transfer.hdf5")
> head(data)
         first    second class
1 0.4170220047 0.3266449     0
2 0.7203244934 0.5270581     0
3 0.0001143748 0.8859421     1
4 0.3023325726 0.3572698     1
5 0.1467558908 0.9085352     1
6 0.0923385948 0.6233601     1

I hope this helps someone. :-)

@shoyer

This comment has been minimized.

Copy link
Member

commented Mar 13, 2015

This looks very helpful!

Would you like to submit a PR that adds a link to this issue in the documentation?

@joschkazj

This comment has been minimized.

Copy link
Author

commented Mar 13, 2015

Yes, I'll do that, but it might take some time.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2015

closed by #9661

@jreback jreback closed this Mar 17, 2015

@cdeterman

This comment has been minimized.

Copy link

commented Jun 29, 2017

This is helpful but what about if the format is set to 'table'. The provided function doesn't seem to work for this situation.

Python

store.put("dataframe", df, format = 'table', data_columns=df.columns)

R

> loadhdf5data(h5File)
data frame with 0 columns and 0 rows
@joschkazj

This comment has been minimized.

Copy link
Author

commented Jun 30, 2017

For table format you may use rhdf5 directly (non-working exerpts):

Python

with pd.HDFStore(out_name, mode="w", complib=str("zlib"),
                 complevel=5) as hdf_store:
    # Write some data
    hdf_store.append("features", job_data.loc[:, feat_columns],
                     format="table", index=False)
    hdf_store.append("labels", job_data.loc[:, label_columns],
                     format="table", data_columns=label_columns, index=False)

R:

library(rhdf5)

loadFeatures <- function(h5File) {
  # Load feature values from separate HDF5 tables into data.frame object
  #
  # Args:
  #   h5File: filename of HDF5 file to be loaded. It has to contain two tables:
  #   "/features/table" with feature values and "/labels/table" with
  #   corresponding block labels.
  #
  # Returns:
  #   A data.frame with feature values and block labels

  labels <- h5read(h5File, "/labels/table", read.attributes = FALSE)
  featTable <- h5read(h5File, "/features/table", compoundAsDataFrame = FALSE)
  feats <- data.frame(t(featTable$values_block_0))
  # data format conversion is application specific 
  feats$job <- factor(labels$job)
  feats$layer <- factor(labels$layer)
  feats$block <- labels$block

  feats$isElevated <- as.logical(labels$is_elevated)
  feats$partLabel <- labels$part_label

  return(feats)
}

feats <- loadFeatures(few_h5File)

It's been a while and I haven't used pandas and R in combination since, but this should get you started.

@Franck-Dernoncourt

This comment has been minimized.

Copy link

commented Jul 14, 2017

@joschkazj Thanks for sharing your function. I am having issues with integers larger than 32 bit.

For example, if I create the data frame:

import pandas as pd

frame = pd.DataFrame({
    'time':[1234567001,1234515616515167005],
    'X2':[23.88,23.96]
},columns=['time','X2'])

store = pd.HDFStore('a.hdf5')
store['df'] =  frame
store.close()
print(frame)

The 1234515616515167005 becomes a NA in R if I use your loadhdf5data function.

I get this warning message:

> frame = loadhdf5data("a.hdf5")
[1] 1
[1] 2
Warning message:
In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  NAs produced by integer overflow while converting 64-bit integer or unsigned 32-bit integer from HDF5 to a 32-bit integer in R. Choose bit64conversion='bit64' or bit64conversion='double' to avoid data loss and see the vignette 'rhdf5' for more details about 64-bit integers.

However, adding bit64conversion='bit64' or bit64conversion='double' in h5read doesn't change anything.

Do you have any idea how I can fix this issue?

@Franck-Dernoncourt

This comment has been minimized.

Copy link

commented Jul 14, 2017

To fix this bit64 issue: How can I load a data frame saved in pandas as an HDF5 file in R without losing integers larger than 32 bit?. (In short: install.packages("bit64”)+ library(bit64) + added bit64conversion='bit64' twice in loadhdf5data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.