Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: example of DataFrame export to HDF5 and import into R #9636

Closed
joschkazj opened this issue Mar 12, 2015 · 7 comments
Closed

DOC: example of DataFrame export to HDF5 and import into R #9636

joschkazj opened this issue Mar 12, 2015 · 7 comments
Labels
Docs IO HDF5 read_hdf, HDFStore
Milestone

Comments

@joschkazj
Copy link

When searching the web I didn't find any examples of a working pandas to R data transfer using HDF5 files, even though pandas's documentation mentions the used HDF5 format "can easily be imported into R using the rhdf5 library". The pandas export works as expected and I inspected the file format using the HDF group's viewer (HDFView).
After some experimentation I have a working sample for dataframe export from Python/pandas and import into R, which could be added to the documentation to help future users:

# Example of HDF5 export for R

import numpy as np
import pandas as pd

np.random.seed(1)
df = pd.DataFrame({"first": np.random.rand(100),
                   "second": np.random.rand(100),
                   "class": np.random.randint(0, 2, (100,))},
                   index=range(100))

print(df.head())

store = pd.HDFStore("transfer.hdf5", "w", complib=str("zlib"), complevel=5)
store.put("dataframe", df, data_columns=df.columns)
store.close()

Output:

   class     first    second
0      0  0.417022  0.326645
1      0  0.720324  0.527058
2      1  0.000114  0.885942
3      1  0.302333  0.357270
4      1  0.146756  0.908535
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)

data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")

columns = list()
for (idx in seq(data_paths)) {
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

Now you can import the DataFrame:

> data = loadhdf5data("transfer.hdf5")
> head(data)
         first    second class
1 0.4170220047 0.3266449     0
2 0.7203244934 0.5270581     0
3 0.0001143748 0.8859421     1
4 0.3023325726 0.3572698     1
5 0.1467558908 0.9085352     1
6 0.0923385948 0.6233601     1

I hope this helps someone. :-)

@shoyer
Copy link
Member

shoyer commented Mar 13, 2015

This looks very helpful!

Would you like to submit a PR that adds a link to this issue in the documentation?

@joschkazj
Copy link
Author

Yes, I'll do that, but it might take some time.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2015

closed by #9661

@jreback jreback closed this as completed Mar 17, 2015
@cdeterman
Copy link

This is helpful but what about if the format is set to 'table'. The provided function doesn't seem to work for this situation.

Python

store.put("dataframe", df, format = 'table', data_columns=df.columns)

R

> loadhdf5data(h5File)
data frame with 0 columns and 0 rows

@joschkazj
Copy link
Author

For table format you may use rhdf5 directly (non-working exerpts):

Python

with pd.HDFStore(out_name, mode="w", complib=str("zlib"),
                 complevel=5) as hdf_store:
    # Write some data
    hdf_store.append("features", job_data.loc[:, feat_columns],
                     format="table", index=False)
    hdf_store.append("labels", job_data.loc[:, label_columns],
                     format="table", data_columns=label_columns, index=False)

R:

library(rhdf5)

loadFeatures <- function(h5File) {
  # Load feature values from separate HDF5 tables into data.frame object
  #
  # Args:
  #   h5File: filename of HDF5 file to be loaded. It has to contain two tables:
  #   "/features/table" with feature values and "/labels/table" with
  #   corresponding block labels.
  #
  # Returns:
  #   A data.frame with feature values and block labels

  labels <- h5read(h5File, "/labels/table", read.attributes = FALSE)
  featTable <- h5read(h5File, "/features/table", compoundAsDataFrame = FALSE)
  feats <- data.frame(t(featTable$values_block_0))
  # data format conversion is application specific 
  feats$job <- factor(labels$job)
  feats$layer <- factor(labels$layer)
  feats$block <- labels$block

  feats$isElevated <- as.logical(labels$is_elevated)
  feats$partLabel <- labels$part_label

  return(feats)
}

feats <- loadFeatures(few_h5File)

It's been a while and I haven't used pandas and R in combination since, but this should get you started.

@Franck-Dernoncourt
Copy link

@joschkazj Thanks for sharing your function. I am having issues with integers larger than 32 bit.

For example, if I create the data frame:

import pandas as pd

frame = pd.DataFrame({
    'time':[1234567001,1234515616515167005],
    'X2':[23.88,23.96]
},columns=['time','X2'])

store = pd.HDFStore('a.hdf5')
store['df'] =  frame
store.close()
print(frame)

The 1234515616515167005 becomes a NA in R if I use your loadhdf5data function.

I get this warning message:

> frame = loadhdf5data("a.hdf5")
[1] 1
[1] 2
Warning message:
In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  NAs produced by integer overflow while converting 64-bit integer or unsigned 32-bit integer from HDF5 to a 32-bit integer in R. Choose bit64conversion='bit64' or bit64conversion='double' to avoid data loss and see the vignette 'rhdf5' for more details about 64-bit integers.

However, adding bit64conversion='bit64' or bit64conversion='double' in h5read doesn't change anything.

Do you have any idea how I can fix this issue?

@Franck-Dernoncourt
Copy link

To fix this bit64 issue: How can I load a data frame saved in pandas as an HDF5 file in R without losing integers larger than 32 bit?. (In short: install.packages("bit64”)+ library(bit64) + added bit64conversion='bit64' twice in loadhdf5data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants