# HDF5 file tutorial in R

This is a notebook to show the simple usage of the HDF5 file. HDF5 files can be used as a compression of scientific data that is too large to store in memory. Simple scripts will allow users to retrieve parts of the data and load them into memory to perform calculations and analysis of them without the need of unpacking everything.

First we want to install the needed packages.

In [1]:
source("https://bioconductor.org/biocLite.R")
biocLite("rhdf5")

Bioconductor version 3.4 (BiocInstaller 1.24.0), ?biocLite for help
A new version of Bioconductor is available after installing the most recent
  version of R; see http://bioconductor.org/install
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.4 (BiocInstaller 1.24.0), R 3.3.1 (2016-06-21).
Installing package(s) ‘rhdf5’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'assertthat', 'BH', 'car', 'caret', 'codetools', 'colorspace',
  'crayon', 'curl', 'DBI', 'devtools', 'digest', 'dplyr', 'evaluate',
  'forecast', 'formatR', 'ggplot2', 'git2r', 'htmltools', 'httpuv', 'httr',
  'IRdisplay', 'jsonlite', 'knitr', 'lattice', 'lme4', 'markdown', 'MASS',
  'Matrix', 'memoise', 'mgcv', 'mime', 'nlme', 'nycflights13', 'openssl',
  'pbdZMQ', 'pbkrtest', 'quantreg', 'R6', 'Rcpp', 'RcppArmadillo', 'RcppEigen',
  'repr', 'reshape2', 'rmarkdown', 'RSQLite', 'rstudioapi', 'scales', 'shiny',
  'SparseM', 'stringi', 'stringr', 'tibble', 'tidyr', 'tse

In [2]:
library("rhdf5")

After installing the bioconductor package and loading the library we can now create a new H5 file.

In [3]:
h5createFile("gene_expression.h5")

Folders can be created to sort data entries. For this example we will use two folders, one for meta data and one for the numerical data.

In [4]:
h5createGroup("gene_expression.h5","meta")
h5createGroup("gene_expression.h5","data")

The data we want to save is in integer format. You can also choose different types of data in the storage.mode. Since we want to compress our gene counts efficiently we are using integer storage.mode.

As an example we will just create a random matrix and pretend it's our gene expression.

In [5]:
gene_expression_matrix = matrix(sample(1:10000, 1000*100, replace=T), 1000, 100)

In [6]:
gene_expression_matrix[1:4,1:4]

0,1,2,3
4442,1299,5387,2277
9931,3034,4474,1307
6711,1614,3058,9725
1699,1211,6378,5005


Now that we have our authentic gene expression we can save it to the hard drive into our newly created H5 file. First we create an empty matrix and then fill in the gene expression. The chunk size defines the data packages that can be retrieved individually.

In [7]:
h5createDataset("gene_expression.h5", "data/expression", c(1000, 100), chunk=c(200,100), storage.mode = "integer")
H5close()
h5write(gene_expression_matrix, "gene_expression.h5", "data/expression", index=list(1:nrow(gene_expression_matrix), 1:ncol(gene_expression_matrix)))
H5close()

We can also save the gene symbols and column names. We save these vectors in the meta data folder/category. We can save some face gene names as an example. We don't have to build the object before writing if we want to write everything at the same time.

In [8]:
gene_names = as.character(1:1000)
h5write(gene_names, "gene_expression.h5", "meta/genes")
H5close()

We have now created an h5 file and saved meta data and a gene count matrix. We can see the content with the following command.

In [9]:
h5ls("gene_expression.h5")

Unnamed: 0,group,name,otype,dclass,dim
0,/,data,H5I_GROUP,,
1,/data,expression,H5I_DATASET,INTEGER,1000 x 100
2,/,meta,H5I_GROUP,,
3,/meta,genes,H5I_DATASET,STRING,1000


# Reading data from H5

Now that we have written the data to the disk we can retrieve parts of it. Here we load all gene names from the meta folder and the first 500 gene expressionvalues in column 1 to 10 and additionally column 20 and 30.

In [10]:
retrieved_genes = h5read("gene_expression.h5", "meta/genes")
sub_matrix = h5read("gene_expression.h5", "data/expression", index=list(1:500, c(1:10, 20, 30)))

In [11]:
dim(sub_matrix)

In [12]:
sub_matrix[1:3,1:3]

0,1,2
4442,1299,5387
9931,3034,4474
6711,1614,3058
