DOC: example of DataFrame export to HDF5 and import into R #9636

joschkazj · 2015-03-12T15:50:32Z

When searching the web I didn't find any examples of a working pandas to R data transfer using HDF5 files, even though pandas's documentation mentions the used HDF5 format "can easily be imported into R using the rhdf5 library". The pandas export works as expected and I inspected the file format using the HDF group's viewer (HDFView).
After some experimentation I have a working sample for dataframe export from Python/pandas and import into R, which could be added to the documentation to help future users:

# Example of HDF5 export for R

import numpy as np
import pandas as pd

np.random.seed(1)
df = pd.DataFrame({"first": np.random.rand(100),
                   "second": np.random.rand(100),
                   "class": np.random.randint(0, 2, (100,))},
                   index=range(100))

print(df.head())

store = pd.HDFStore("transfer.hdf5", "w", complib=str("zlib"), complevel=5)
store.put("dataframe", df, data_columns=df.columns)
store.close()

Output:

   class     first    second
0      0  0.417022  0.326645
1      0  0.720324  0.527058
2      1  0.000114  0.885942
3      1  0.302333  0.357270
4      1  0.146756  0.908535

# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)

data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")

columns = list()
for (idx in seq(data_paths)) {
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

Now you can import the DataFrame:

> data = loadhdf5data("transfer.hdf5")
> head(data)
         first    second class
1 0.4170220047 0.3266449     0
2 0.7203244934 0.5270581     0
3 0.0001143748 0.8859421     1
4 0.3023325726 0.3572698     1
5 0.1467558908 0.9085352     1
6 0.0923385948 0.6233601     1

I hope this helps someone. :-)

The text was updated successfully, but these errors were encountered:

shoyer · 2015-03-13T07:34:48Z

This looks very helpful!

Would you like to submit a PR that adds a link to this issue in the documentation?

joschkazj · 2015-03-13T12:10:03Z

Yes, I'll do that, but it might take some time.

jreback · 2015-03-17T10:19:10Z

closed by #9661

cdeterman · 2017-06-29T15:55:57Z

This is helpful but what about if the format is set to 'table'. The provided function doesn't seem to work for this situation.

Python

store.put("dataframe", df, format = 'table', data_columns=df.columns)

R

> loadhdf5data(h5File)
data frame with 0 columns and 0 rows

joschkazj · 2017-06-30T09:33:43Z

For table format you may use rhdf5 directly (non-working exerpts):

Python

with pd.HDFStore(out_name, mode="w", complib=str("zlib"),
                 complevel=5) as hdf_store:
    # Write some data
    hdf_store.append("features", job_data.loc[:, feat_columns],
                     format="table", index=False)
    hdf_store.append("labels", job_data.loc[:, label_columns],
                     format="table", data_columns=label_columns, index=False)

R:

library(rhdf5)

loadFeatures <- function(h5File) {
  # Load feature values from separate HDF5 tables into data.frame object
  #
  # Args:
  #   h5File: filename of HDF5 file to be loaded. It has to contain two tables:
  #   "/features/table" with feature values and "/labels/table" with
  #   corresponding block labels.
  #
  # Returns:
  #   A data.frame with feature values and block labels

  labels <- h5read(h5File, "/labels/table", read.attributes = FALSE)
  featTable <- h5read(h5File, "/features/table", compoundAsDataFrame = FALSE)
  feats <- data.frame(t(featTable$values_block_0))
  # data format conversion is application specific 
  feats$job <- factor(labels$job)
  feats$layer <- factor(labels$layer)
  feats$block <- labels$block

  feats$isElevated <- as.logical(labels$is_elevated)
  feats$partLabel <- labels$part_label

  return(feats)
}

feats <- loadFeatures(few_h5File)

It's been a while and I haven't used pandas and R in combination since, but this should get you started.

Franck-Dernoncourt · 2017-07-14T01:55:13Z

@joschkazj Thanks for sharing your function. I am having issues with integers larger than 32 bit.

For example, if I create the data frame:

import pandas as pd

frame = pd.DataFrame({
    'time':[1234567001,1234515616515167005],
    'X2':[23.88,23.96]
},columns=['time','X2'])

store = pd.HDFStore('a.hdf5')
store['df'] =  frame
store.close()
print(frame)

The 1234515616515167005 becomes a NA in R if I use your loadhdf5data function.

I get this warning message:

> frame = loadhdf5data("a.hdf5")
[1] 1
[1] 2
Warning message:
In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  NAs produced by integer overflow while converting 64-bit integer or unsigned 32-bit integer from HDF5 to a 32-bit integer in R. Choose bit64conversion='bit64' or bit64conversion='double' to avoid data loss and see the vignette 'rhdf5' for more details about 64-bit integers.

However, adding bit64conversion='bit64' or bit64conversion='double' in h5read doesn't change anything.

Do you have any idea how I can fix this issue?

Franck-Dernoncourt · 2017-07-14T03:44:07Z

To fix this bit64 issue: How can I load a data frame saved in pandas as an HDF5 file in R without losing integers larger than 32 bit?. (In short: install.packages("bit64”)+ library(bit64) + added bit64conversion='bit64' twice in loadhdf5data)

jreback added Docs IO HDF5 read_hdf, HDFStore labels Mar 13, 2015

joschkazj mentioned this issue Mar 16, 2015

DOC: example of pandas to R transfer of DataFrame using HDF5 file #9661

Closed

jreback added this to the 0.16.0 milestone Mar 17, 2015

jreback closed this as completed Mar 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: example of DataFrame export to HDF5 and import into R #9636

DOC: example of DataFrame export to HDF5 and import into R #9636

joschkazj commented Mar 12, 2015

shoyer commented Mar 13, 2015

joschkazj commented Mar 13, 2015

jreback commented Mar 17, 2015

cdeterman commented Jun 29, 2017

joschkazj commented Jun 30, 2017

Franck-Dernoncourt commented Jul 14, 2017

Franck-Dernoncourt commented Jul 14, 2017

DOC: example of DataFrame export to HDF5 and import into R #9636

DOC: example of DataFrame export to HDF5 and import into R #9636

Comments

joschkazj commented Mar 12, 2015

shoyer commented Mar 13, 2015

joschkazj commented Mar 13, 2015

jreback commented Mar 17, 2015

cdeterman commented Jun 29, 2017

joschkazj commented Jun 30, 2017

Franck-Dernoncourt commented Jul 14, 2017

Franck-Dernoncourt commented Jul 14, 2017