# BMI 535/635: Management & Processing of Large-scale Data

#### Author: Michael Mooney (mooneymi@ohsu.edu)

## Week 3: Data Storage and Querying Solutions in R

1. Introduction
2. Learning Objectives
3. Resource Profiling
4. Review of R Data Types
5. Data from dbSNP
6. Connecting to Relational DBs
7. data.table
8. BigMemory
9. ff
10. HDF5

Requirements:

- R packages:
    - pryr
    - RMySQL
    - data.table
    - bigmemory
    - bigalgebra
    - ff
    - ffbase
    - rhdf5
    - bit64
    - profmem
    - parallel
- Data files:
    - dbSNP annotations (chromosome 1 only): `chr1_reducedCols.txt.gz` (download this from the state server)
    - A MySQL config file containing connection parameters: `~/.my.cnf`

In [1]:
library(pryr)
library(RMySQL)
library(data.table)
library(bigmemory)
library(bigalgebra)
library(ff)
library(ffbase)
library(rhdf5)
library(bit64)
library(profmem)
library(parallel)

Loading required package: DBI

Attaching package: ‘data.table’

The following object is masked from ‘package:pryr’:

    address

Loading required package: bigmemory.sri
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: ‘bit’

The following object is masked from ‘package:data.table’:

    setattr

The following object is masked from ‘package:base’:

    xor

Attaching package ff
- getOption("fftempdir")=="/var/folders/3r/wws_4jz54ms2t6m0jrz8k_6mg3ll58/T//RtmpsotR0Z"

- getOption("ffextension")=="ff"

- getOption("ffdrop")==TRUE

- getOption("fffinonexit")==TRUE

- getOption("ffpagesize")==65536

- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writ

## Introduction

This is the **R** version of the previous lecture. We'll be addressing the same big-data issues, but this time exploring solutions offered in R. Just as a reminder, here are the problems often faced when working with large data sets:

1. Data does not fit into memory
    - In particular, this can be a problem when setting up parallel computations, where each process needs the full data
    - R can sometimes present unique challenges when it comes to memory usage. For more info see the following:
    - ([http://adv-r.had.co.nz/memory.html](http://adv-r.had.co.nz/memory.html))
    - `?Memory`
2. Accessing (querying) the data is slow
3. Data files on-disk are very large (i.e. not easily portable)

Potential Solutions:

1. Use on-disk storage that is optimized for fast read/write access
2. Use data storage that allows for multiple concurrent reads (i.e. can be shared across multiple processes)
3. Use data compression

### Learning Objectives

1. You will learn some basic methods for profiling the amount of resources and time used by computational tasks
2. You will learn how to store large datasets in various "high-performance" R data structures
3. You will learn how to query data in each of the data structures
4. You will learn how to convert between these various data storage solutions


## Resource Profiling

`system.time` can be used to measure the runtime of a particular block of code. And there are a number of options for measuring memory usage (`mem_used`, `mem_change`, `memprof`, `lineprof`). Examples are shown below.

More information can be found at Hadley Wickham's 'Advanced R' site: 

[http://adv-r.had.co.nz/Profiling.html](http://adv-r.had.co.nz/Profiling.html)

[http://adv-r.had.co.nz/memory.html#memory-profiling](http://adv-r.had.co.nz/memory.html#memory-profiling)

In [2]:
## Print the amount of memory R is currently using
mem_used()

36.1 MB

In [3]:
## A dummy function that simply creates a large list, but returns nothing
foo = function(a, n=100) {
    Sys.sleep(2)
    b = rep(a, n)
    Sys.sleep(1)
    return(NULL)
}

## Use system.time to see how long code takes to run
system.time({foo(1,10000000)})

   user  system elapsed 
  0.096   0.022   3.121 

In [4]:
## See the change in memory after a piece of code is run
mem_change({foo(1,10000000)})

10.4 kB

In [5]:
mem_used()

36.4 MB

In [6]:
## profmem displays information about memory allocation for a block of code
profmem({foo(1,10000000)})

bytes,trace
80000040,foo


## Review of Basic R Data Types

Basic R data types and when to use them:

**Vectors**: Vectors store collections of data elements of a single type. R will perform automatic type conversions, so be careful and pay attention to your data types. Vectors can be named, so you can access elements by name or by index. Note: set operations can be performed on vectors.

**Lists**: An R list is similar to a Python dictionary because it is a labeled collection of items (you can find an item based on a key). However, R lists are not stored in a way that makes fast lookups possible. If you have a large collection of data and need to repeatedly search for specific items, use an environment instead.

**Environments**: An environment can be created and accessed very much like a list, but because of they way it is stored internally, data access is much faster. 

**DataFrames**: A table data structure (the inspiration for the Pandas DataFrame in Python), which can hold columns of different data types.


In [7]:
## Create some example data
VECTOR1 = sample(c(1:1000000), 1000000)
VECTOR2 = c(1:1000000)
LIST1 = as.list(VECTOR2)
names(LIST1) = as.character(VECTOR1)
DF1 = data.frame(A=VECTOR1, B=VECTOR2)
rownames(DF1) = as.character(VECTOR1)

In [8]:
mem_used()

200 MB

In [9]:
## How long does it take to find an item?
## Using a vector
t = system.time({idx = match(567890, VECTOR1)})
print("Vector:")
print(idx)
print(t)

## Using a vector version 2
t = system.time({idx = which(VECTOR1 == 567890)})
print("Vector #2:")
print(idx)
print(t)

## Using a list
t = system.time({idx = LIST1[[as.character(567890)]]})
print("List:")
print(idx)
print(t)

## Using a dataframe
t = system.time({idx = match(567890, DF1$A)})
print("DataFrame:")
print(idx)
print(t)

## Using a dataframe version 2
t = system.time({idx = DF1$B[DF1$A == 567890]})
print("DataFrame #2:")
print(idx)
print(t)

## Using a dataframe version 3 (rownames)
t = system.time({idx = DF1[as.character(567890), 'B']})
print("DataFrame #3:")
print(idx)
print(t)

[1] "Vector:"
[1] 297206
   user  system elapsed 
  0.006   0.000   0.007 
[1] "Vector #2:"
[1] 297206
   user  system elapsed 
  0.014   0.001   0.015 
[1] "List:"
[1] 297206
   user  system elapsed 
  0.005   0.000   0.005 
[1] "DataFrame:"
[1] 297206
   user  system elapsed 
  0.005   0.001   0.006 
[1] "DataFrame #2:"
[1] 297206
   user  system elapsed 
  0.014   0.001   0.015 
[1] "DataFrame #3:"
[1] 297206
   user  system elapsed 
  0.021   0.001   0.021 


In [10]:
## How long does it take to determine if an item exists?
x = 567890
## Using a vector
t = system.time({test = x %in% VECTOR1})
print("Vector:")
print(test)
print(t)

## Using a list
t = system.time({test = as.character(x) %in% names(LIST1)})
print("List:")
print(test)
print(t)

## Using a dataframe
t = system.time({test = x %in% DF1$A})
print("DataFrame:")
print(test)
print(t)

## Using a dataframe version 2
t = system.time({test = x %in% rownames(DF1)})
print("DataFrame #2:")
print(test)
print(t)

[1] "Vector:"
[1] TRUE
   user  system elapsed 
  0.006   0.001   0.007 
[1] "List:"
[1] TRUE
   user  system elapsed 
  0.007   0.001   0.006 
[1] "DataFrame:"
[1] TRUE
   user  system elapsed 
  0.006   0.000   0.006 
[1] "DataFrame #2:"
[1] TRUE
   user  system elapsed 
  0.007   0.001   0.008 


In [11]:
## Now let's compare a list to an environment
ENV1 = as.environment(LIST1)

In [12]:
mem_used()

378 MB

In [13]:
## How long does it take to find an item?
## Using a list
t = system.time({idx = LIST1[[as.character(567890)]]})
print("List:")
print(idx)
print(t)

## Using an environment
t = system.time({idx = ENV1[[as.character(567890)]]})
print("Environment:")
print(idx)
print(t)

[1] "List:"
[1] 297206
   user  system elapsed 
  0.006   0.001   0.006 
[1] "Environment:"
[1] 297206
   user  system elapsed 
  0.000   0.000   0.001 


In [14]:
## How long does it take to determine if an item exists?
x = 567890
## Using a list
t = system.time({test = as.character(x) %in% names(LIST1)})
print("List:")
print(test)
print(t)

## Using a list
t = system.time({test = exists(as.character(x), where=ENV1)})
print("Environment:")
print(test)
print(t)

[1] "List:"
[1] TRUE
   user  system elapsed 
  0.006   0.001   0.007 
[1] "Environment:"
[1] TRUE
   user  system elapsed 
      0       0       0 


## dbSNP Dataset

For the following examples, we'll be using data from dbSNP, which contains information about all single nucleotide polymorphisms (SNPs) on human chromosome 1. The data file is a tab-delimited text file containing four columns: the 'rs' number of the SNP, the chromosome, the position, and a comma-separated list of genes at the same location. Note: the file contains a multi-line header.

In [15]:
print(system("head ./xdata/chr1_reducedCols.txt", intern=TRUE))

 [1] "dbSNP Chromosome Report"                                                                 
 [2] "Refer to ftp://ftp.ncbi.nlm.nih.gov/snp/00readme for documentation on tabular data below"
 [3] ""                                                                                        
 [4] "rs#\tchr\tchr\tlocal"                                                                    
 [5] "\t\tpos\tloci"                                                                           
 [6] ""                                                                                        
 [7] ""                                                                                        
 [8] "171\t1\t175261679\t"                                                                     
 [9] "242\t1\t20869461\t"                                                                      
[10] "538\t1\t6160958\tKCNAB2"                                                                 


## Connecting to Relational DBs in R

We'll be connecting to the same DB as last time. The R package `RMySQL` will connect to the database using connection settings stored in a configuration file in your home directory (`~/.my.cnf`). The file should contain 'groups' of settings for databases that you connect to frequently. For example, the following should be entered in the configuration file to allow you to connect to a database called 'bmi535' (the square brackets indicate a 'group', and you can have multiple of these in the same file).

    [bmi535_snps]
    host=localhost
    user=mooneymi
    password=mypassword
    database=bmi535_snps

In [16]:
## Connect to the MySQL database using connection settings defined in ~/.my.cnf
conn = dbConnect(RMySQL::MySQL(), group="bmi535_snps")

In [17]:
## Let's query the DB
system.time({query = "SELECT * FROM snps WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
res = dbSendQuery(conn, query)
rows = dbFetch(res)})

   user  system elapsed 
  0.004   0.000   5.398 

In [18]:
rows
dbClearResult(res)

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [19]:
## Now let's query the DB using the indexed table
system.time({query = "SELECT * FROM snps_idx WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
res = dbSendQuery(conn, query)
rows = dbFetch(res)})

   user  system elapsed 
  0.002   0.000   0.014 

In [20]:
rows
dbClearResult(res)

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [21]:
## Disconnect from the DB
dbDisconnect(conn)

## `data.table`

The `data.table` package implements what is essentially an optimized dataframe. 

In [22]:
## Let's start by loading the data into a standard R dataframe
## Note we can load directly from a compressed file (gzip)
## This takes a few minutes
mem_used()
system.time({snps = read.delim('./xdata/chr1_reducedCols.txt.gz', header=F, skip=7, sep='\t', 
                               col.names=c('rs', 'chr', 'pos', 'loci'), as.is=T, na.strings=c('NA', '', ' '))})
mem_used()

378 MB

   user  system elapsed 
126.335   1.134 127.571 

753 MB

In [23]:
## Get the size of the dataframe
dim(snps)

In [24]:
## View the first few rows
head(snps)

rs,chr,pos,loci
171,1,175261679,
242,1,20869461,
538,1,6160958,KCNAB2
546,1,93617546,TMED5
549,1,15546825,TMEM51
568,1,203713133,ATP2B4


In [25]:
## Let's look at the data types for each column
sapply(snps, class)

****It's important to note here that (IMO) R does a better job with missing values and data types than Pandas. For example, missing values are allowed in both character and numeric columns, and don't require special treatment. Of course, we saw that the data cleaning in Pandas was fairly easy if you know what to look for.**

In [26]:
## Search the dataframe for a specific row
## Note: here we wrap the condition inside which() to exclude rows with NAs
system.time({row = snps[which(with(snps, chr==1 & pos==225512846 & loci=='DNAH14')), ]})
row

   user  system elapsed 
  0.743   0.072   0.817 

Unnamed: 0,rs,chr,pos,loci
3456789,189425743,1,225512846,DNAH14


In [27]:
## Don't make the query more complicated than it needs to be
system.time({row = snps[which(with(snps, pos==225512846)), ]})
row

   user  system elapsed 
  0.212   0.013   0.225 

Unnamed: 0,rs,chr,pos,loci
3456789,189425743,1,225512846,DNAH14


### Load Data into a `data.table`

In [28]:
## Load SNP data into data.table
mem_change({snps_dt = as.data.table(snps)})

245 MB

In [29]:
## Query the data.table
system.time({row = snps_dt[chr==1 & pos==225512846 & loci=='DNAH14',]})
row

   user  system elapsed 
  0.688   0.040   0.731 

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [30]:
## Add a key to the data.table
setkey(snps_dt, pos)

In [31]:
## Query the data.table with key
system.time({row = snps_dt[chr==1 & pos==225512846 & loci=='DNAH14']})
row

   user  system elapsed 
  0.578   0.041   0.620 

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [32]:
## Matching on only the key improves performance significantly over 
## a regular dataframe
system.time({row = snps_dt[pos==225512846]})
row

   user  system elapsed 
  0.014   0.001   0.016 

rs,chr,pos,loci
189425743,1,225512846,DNAH14


## BigMemory

The `bigmemory` package allows for storing large datasets in shared-memory and file-backed data structures. This allows for large data structures to be shared across multiple R processes to facilitate efficient parallel processing. 

One caveat is that `bigmemory` creates matrices, which will handle only a single data type, unlike dataframes. By default, character columns in dataframes will be converted to factors and factors converted to numeric levels. The `ff` package discussed below may be a better solution if you must have multiple data types in the same object.

In [33]:
## Let's create a file-backed bigmatrix object using only 
## the first 3 columns of the snps dataframe
## First check that the file doesn't exist
if (file.exists("./xdata/snps_bigmem.bin")) {
    file.remove("./xdata/snps_bigmem.bin")
    file.remove("./xdata/snps_bigmem.bin.desc")
}

In [34]:
## Create a bigmatrix object with the first three columns of the SNPs dataframe (all integers)
## Note: we are first converting the dataframe to a numeric matrix
mem_change({snps_bm = as.big.matrix(as.matrix(snps[,1:3]), type="integer", backingfile="snps_bigmem.bin", backingpath="./xdata")})

“No descriptor file given, it will be named snps_bigmem.bin.desc”

115 kB

In [35]:
## How much space is used on disk
print(system("ls -lt ./xdata/snps_bigmem*", intern=TRUE))

[1] "-rw-r--r--  1 mooneymi  OHSUM01\\Domain Users  146855317 Jan 18 15:29 ./xdata/snps_bigmem.bin"     
[2] "-rw-r--r--  1 mooneymi  OHSUM01\\Domain Users        494 Jan 18 15:29 ./xdata/snps_bigmem.bin.desc"


In [36]:
head(snps_bm)

rs,chr,pos
171,1,175261679
242,1,20869461
538,1,6160958
546,1,93617546
549,1,15546825
568,1,203713133


In [37]:
## Use the mwhich() function to query the bigmatrix object
system.time({row = snps_bm[mwhich(snps_bm, c('chr','pos'), c(1, 225512846), c('eq', 'eq')), ]})
row

   user  system elapsed 
  1.469   0.001   1.471 

The performance for searching is pretty poor, but keep in mind that `bigmemory` was designed with numeric matrices in mind, not tables of heterogeneous data. 

In [38]:
## Data access is pretty fast for specific data elements
## i.e. selecting a specific index
system.time({z = snps_bm[1200:1220,]})

   user  system elapsed 
      0       0       0 

In [39]:
z

rs,chr,pos
12361,1,224564377.0
12371,1,180163390.0
12375,1,10596341.0
12384,1,32256166.0
12386,1,36068863.0
12395,1,38268836.0
12419,1,44686322.0
12439,1,25169634.0
12442,1,43829177.0
12455,1,55533917.0


### `bigalgebra` 

The `bigalgebra` package allows efficient linear algebra operations on `bigmemory` matrices.

In [40]:
if (file.exists("./xdata/bigmem.bin")) {
    file.remove("./xdata/bigmem.bin")
    file.remove("./xdata/bigmem.bin.desc")
}
## Let's create another on-disk bigmatrix
bm = as.big.matrix(matrix(runif(1000000), 10000, 100), type="double", backingfile="bigmem.bin", backingpath="./xdata")

“No descriptor file given, it will be named bigmem.bin.desc”

In [41]:
## Create another bigmatrix object with just the first row
v = as.big.matrix(bm[1,])

“Coercing vector to a single-column matrix.”

In [42]:
## Dimensions of the matrix
dim(bm)

In [43]:
## Dimensions of the vector
dim(v)

In [44]:
system.time({x = bm %*% v})

   user  system elapsed 
  0.014   0.004   0.019 

In [45]:
dim(x)

## ff

Similar to `bigmemory`, the `ff` packages allows for on-disk storage of large datasets with efficient data access and the ability to share the same data structure across multiple R processes.

In [46]:
getOption("fftempdir")

In [47]:
if (!dir.exists("./xdata/ff")) {
    dir.create("./xdata/ff")
} else {
    system("rm ./xdata/ff/*")
}

In [48]:
## Let's create a ffdf object
## Again, you can load data from a compressed file
## If you don't set the asffdf_args, the files will be created
## in the fftempdir (see above)
mem_change({snps_ff = read.delim.ffdf(file='./xdata/chr1_reducedCols.txt.gz', header=F, skip=7, sep='\t', 
                                      asffdf_args=list(col_args=list(pattern = "./xdata/ff/snps_ff")))})

1.39 MB

In [49]:
## Set column names
colnames(snps_ff) = c('rs','chr','pos','loci')

In [50]:
head(snps_ff)

ffdf (all open) dim=c(12237943,4), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
     PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
rs             V1      integer       integer FALSE           FALSE
chr            V2      integer       integer FALSE           FALSE
pos            V3      integer       integer FALSE           FALSE
loci           V4      integer       integer FALSE           FALSE
     PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
rs              FALSE                 1                1               1
chr             FALSE                 2                1               1
pos             FALSE                 3                1               1
loci            FALSE                 4                1               1
     PhysicalIsOpen
rs             TRUE
chr            TRUE
pos            TRUE
loci           TRUE
ffdf data
                         rs                chr                pos
1              171          1        

In [51]:
## Use the ffwhich() function to query the data
system.time({row = snps_ff[ffwhich(snps_ff, chr==1 & pos==225512846 & loci=='DNAH14'), ]})
row

   user  system elapsed 
  2.308   0.312   2.929 

ffdf (all open) dim=c(1,4), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
     PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
rs             V1      integer       integer FALSE           FALSE
chr            V2      integer       integer FALSE           FALSE
pos            V3      integer       integer FALSE           FALSE
loci           V4      integer       integer FALSE           FALSE
     PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
rs              FALSE                 1                1               1
chr             FALSE                 2                1               1
pos             FALSE                 3                1               1
loci            FALSE                 4                1               1
     PhysicalIsOpen
rs             TRUE
chr            TRUE
pos            TRUE
loci           TRUE
ffdf data
         rs       chr       pos      loci
1 189425743 1         225512846 DNAH14   

In [52]:
## To save the ffdf oject in an archive to load later
ffsave(list=c('snps_ff'), file='./xdata/ff/snps_ff_archive')

In [53]:
is.open(snps_ff)

In [54]:
## To delete the files and remove the ffdf object do the following
## Note: delete seems a bit unstable (sometime files remain)
close(snps_ff)
delete(snps_ff)

## You may also want to do the following to make sure files are deleted
if (length(list.files("./xdata/ff", ".*\\.ff")) > 0) {
    system("rm ./xdata/ff/*.ff")
}

In [55]:
## Remove the ffdf object 
rm(snps_ff)

In [56]:
## Load the archive again
ffload(file='./xdata/ff/snps_ff_archive')

In [57]:
open(snps_ff)
head(snps_ff)

ffdf (all open) dim=c(12237943,4), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
     PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
rs             V1      integer       integer FALSE           FALSE
chr            V2      integer       integer FALSE           FALSE
pos            V3      integer       integer FALSE           FALSE
loci           V4      integer       integer FALSE           FALSE
     PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
rs              FALSE                 1                1               1
chr             FALSE                 2                1               1
pos             FALSE                 3                1               1
loci            FALSE                 4                1               1
     PhysicalIsOpen
rs             TRUE
chr            TRUE
pos            TRUE
loci           TRUE
ffdf data
                         rs                chr                pos
1              171          1        

In [58]:
## To delete the files and remove the ffdf object do the following
## Note: delete seems a bit unstable (sometime files remain)
close(snps_ff)
delete(snps_ff)

## You may also want to do the following to make sure files are deleted
if (length(list.files("./xdata/ff", ".*\\.ff")) > 0) {
    system("rm ./xdata/ff/*.ff")
}

In [59]:
## Remove the archive
ffdrop(file='./xdata/ff/snps_ff_archive')

## HDF5

The `rhdf5` package provides an interface between R and the HDF5 libraries, much like PyTables in Python. However, `rhdf5` has fairly limited functionality, so it is not as useful for querying heterogenous data sets. But it can be useful for storing large datasets and accessing/processing chunks of that data.

Unfortunately, `rhdf5` does not currently support subsetting compound type HDF5 datasets (multiple data types), so the only option is to read the entire dataset into R. The `h5read()` function will load data from the file into an R array. (We'll show how to subset homogeneous data structures below.)

In [60]:
## Let's use the rhdf5 package to look at our previously saved HDF5 file
 h5ls('./xdata/snps_pandas_hdf_zlib.h5')

Unnamed: 0,group,name,otype,dclass,dim
0,/,snps,H5I_GROUP,,
1,/snps,table,H5I_DATASET,COMPOUND,12237943.0


In [61]:
## Load data from the HDF5 file into an R array
## Note: Pandas uses 64-bit integers, which are not available in base R
snps_hdf5 = h5read('./xdata/snps_pandas_hdf_zlib.h5', 'snps/table', bit64conversion='bit64')

In [62]:
dim(snps_hdf5)

In [63]:
head(snps_hdf5)

index,rs,chr,pos,loci
0,171,1,175261679,
1,242,1,20869461,
2,538,1,6160958,KCNAB2
3,546,1,93617546,TMED5
4,549,1,15546825,TMEM51
5,568,1,203713133,ATP2B4


In [64]:
sapply(snps_hdf5, class)

In [65]:
## Do some data type conversion (necessary for rhdf5 compatibility)
snps_hdf5$index = as.numeric(snps_hdf5$index)
snps_hdf5$rs = as.numeric(snps_hdf5$rs)
snps_hdf5$chr = as.numeric(snps_hdf5$chr)
snps_hdf5$pos = as.numeric(snps_hdf5$pos)

In [66]:
sapply(snps_hdf5, class)

In [67]:
head(snps_hdf5)

index,rs,chr,pos,loci
0,171,1,175261679,
1,242,1,20869461,
2,538,1,6160958,KCNAB2
3,546,1,93617546,TMED5
4,549,1,15546825,TMEM51
5,568,1,203713133,ATP2B4


In [68]:
## Close the file
H5close()

To show different ways to access an HDF5 file that contains a matrix of a single data type, we'll first create a new HDF5 file using `rhdf5`.

In [69]:
## First convert the dataframe to a numeric matrix (just the rs, chr, and pos columns)
m = do.call(cbind, snps_hdf5[,2:4])

In [70]:
## Get the class
class(m)

In [71]:
## And storage mode (data type) of the matrix
storage.mode(m)

In [72]:
## Create a new HDF5 file
if (file.exists("./xdata/snps_rhdf5.h5")) {
    file.remove("./xdata/snps_rhdf5.h5")
}
h5createFile("./xdata/snps_rhdf5.h5")

In [73]:
## Let's create the same file structure as before, 
## but we'll include only the first three numeric columns
h5createGroup("./xdata/snps_rhdf5.h5","snps")
h5ls("./xdata/snps_rhdf5.h5")

Unnamed: 0,group,name,otype,dclass,dim
0,/,snps,H5I_GROUP,,


In [74]:
## First we create a dataset in the HDF5 file
## with the correct dimensions
## We also specifying the chunk size and compression level (default=6)
h5createDataset(file="./xdata/snps_rhdf5.h5", dataset="snps/table", 
                dims=dim(snps_hdf5[,2:4]), chunk=c(1000,3), level=6)

In [75]:
## We convert the dataframe to a matrix to ensure 
## it is stored as a numeric matrix
h5write(m, "./xdata/snps_rhdf5.h5", "snps/table")
h5ls("./xdata/snps_rhdf5.h5")

Unnamed: 0,group,name,otype,dclass,dim
0,/,snps,H5I_GROUP,,
1,/snps,table,H5I_DATASET,FLOAT,12237943 x 3


In [76]:
H5close()

Let's read just a subset of the data. Here we'll use the `h5read()` function and specify indices:

In [77]:
snps_rhdf5 = h5read('./xdata/snps_rhdf5.h5', 'snps/table', index=list(1:10,1:3))
snps_rhdf5

0,1,2
171,1,175261679
242,1,20869461
538,1,6160958
546,1,93617546
549,1,15546825
568,1,203713133
665,1,24181041
672,1,53679329
677,1,173876561
685,1,161191522


In [78]:
H5close()

You can also access HDF5 datasets within a file using file and dataset handles. A file handle is returned by `H5Fopen()`, and the `&` operator allows you to access dataset handles. The '$' will give you access to the data itself (similar to accessing named elements in a list).  

In [79]:
## Open the file and return a file handle
snps_rhdf5_fh = H5Fopen('./xdata/snps_rhdf5.h5')
snps_rhdf5_fh

HDF5 FILE
        name /
    filename 

  name     otype dclass dim
0 snps H5I_GROUP           

In [80]:
## Get a dataset handle by specifying a group in the file
snps_table = snps_rhdf5_fh&'snps/table'
snps_table

HDF5 DATASET
        name /snps/table
    filename 
        type H5T_IEEE_F64LE
        rank 2
        size 12237943 x 3
     maxsize 12237943 x 3

In [81]:
## Access the table under the 'snps' group
snps_rhdf5_fh$'snps/table'[1:5,]

0,1,2
171,1,175261679
242,1,20869461
538,1,6160958
546,1,93617546
549,1,15546825


In [82]:
H5close()

## In-Class Exercises

In [83]:
## Exercise 1.
## Use parallel processes to calculate the column sums 
## of the first ten columns of a file-backed bigmatrix 
## (you can use 'bm' defined above).
## Use describe() and attach.big.matrix() from the bigmemory package.
## An easy option for parallel R processes is mclapply() from the
## parallel R package



In [84]:
## Here's an example of sequentially calling foo() 10 times
system.time({lapply(1:10, function(x){foo(x, 1000000)})})

   user  system elapsed 
  0.051   0.015  30.116 

In [85]:
## Here's an example of using mclapply() to call foo() in parallel, using 4 cores
system.time({mclapply(1:10, function(x){foo(x, 1000000)}, mc.cores=4)})

   user  system elapsed 
  0.072   0.098   9.121 

#### Last Updated: 16-Jan-2018