# BMI 535/635: Management & Processing of Large-scale Data

#### Author: Michael Mooney (mooneymi@ohsu.edu)

## Data Storage and Querying Solutions in R

1. Introduction
2. Learning Objectives
3. Resource Profiling
4. Review of R Data Types
5. Data from dbSNP
6. Connecting to Relational DBs
7. data.table
8. BigMemory
10. HDF5

Requirements:

- R packages:
    - pryr
    - RMySQL
    - data.table
    - bigmemory
    - bigalgebra
    - hdf5r
    - bit64
    - profmem
    - parallel
- Data files:
    - dbSNP annotations (chromosome 1 only): `chr1_reducedCols.txt.gz` (download this from the state server)
    - A MySQL config file containing connection parameters: `~/.my.cnf`

In [1]:
library(pryr)
library(RMySQL)
library(data.table)
library(bigmemory)
library(bigalgebra)
library(hdf5r)
library(bit64)
library(profmem)
library(parallel)

Registered S3 method overwritten by 'pryr':
  method      from
  print.bytes Rcpp
Loading required package: DBI

Attaching package: ‘data.table’

The following object is masked from ‘package:pryr’:

    address

Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: ‘bit’

The following object is masked from ‘package:data.table’:

    setattr

The following object is masked from ‘package:base’:

    xor

Attaching package bit64
package:bit64 (c) 2011-2012 Jens Oehlschlaegel
creators: integer64 seq :
coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bin
logical operator: ! & | xor != == < <= >= >
arithmetic operator: + - * / %/% %% ^
math: sign abs sqrt log log2 log10
m

## Introduction

This is the **R** version of the previous lecture. We'll be addressing the same big-data issues, but this time exploring solutions offered in R. Just as a reminder, here are the problems often faced when working with large data sets:

1. Data does not fit into memory
    - In particular, this can be a problem when setting up parallel computations, where each process needs the full data
    - R can sometimes present unique challenges when it comes to memory usage. For more info see the following:
    - ([http://adv-r.had.co.nz/memory.html](http://adv-r.had.co.nz/memory.html))
    - `?Memory`
2. Accessing (querying) the data is slow
3. Data files on-disk are very large (i.e. not easily portable)

Potential Solutions:

1. Use on-disk storage that is optimized for fast read/write access
2. Use data storage that allows for multiple concurrent reads (i.e. can be shared across multiple processes)
3. Use data compression

### Learning Objectives

1. You will learn some basic methods for profiling the amount of resources and time used by computational tasks
2. You will learn how to store large datasets in various "high-performance" R data structures
3. You will learn how to query data in each of the data structures
4. You will learn how to convert between these various data storage solutions


## Resource Profiling

`system.time` can be used to measure the runtime of a particular block of code. And there are a number of options for measuring memory usage (`mem_used`, `mem_change`, `memprof`). Examples are shown below.

More information can be found at Hadley Wickham's 'Advanced R' site: 

[http://adv-r.had.co.nz/Profiling.html](http://adv-r.had.co.nz/Profiling.html)

[http://adv-r.had.co.nz/memory.html#memory-profiling](http://adv-r.had.co.nz/memory.html#memory-profiling)

In [2]:
## Print the amount of memory R is currently using
mem_used()

61.8 MB

In [3]:
## A dummy function that simply creates a large list, but returns nothing
foo = function(a, n=100) {
    Sys.sleep(2)
    b = rep(a, n)
    Sys.sleep(1)
    return(NULL)
}

## Use system.time to see how long code takes to run
system.time({foo(1,10000000)})

   user  system elapsed 
  0.090   0.017   3.115 

In [4]:
## See the change in memory after a piece of code is run
mem_change({foo(1,10000000)})

8.48 kB

In [5]:
mem_used()

62.1 MB

In [6]:
## profmem displays information about memory allocation for a block of code
profmem({foo(1,10000000)})

what,bytes,trace
alloc,80000048,foo
alloc,528,
alloc,1648,
alloc,1648,
alloc,1072,
alloc,256,
alloc,456,
alloc,216,
alloc,256,


## Review of Basic R Data Types

Basic R data types and when to use them:

**Vectors**: Vectors store collections of data elements of a single type. R will perform automatic type conversions, so be careful and pay attention to your data types. Vectors can be named, so you can access elements by name or by index. Note: set operations can be performed on vectors.

**Lists**: An R list is similar to a Python dictionary because it is a labeled collection of items (you can find an item based on a key). However, R lists are not stored in a way that makes fast lookups possible. If you have a large collection of data and need to repeatedly search for specific items, use an environment instead.

**Environments**: An environment can be created and accessed very much like a list, but because of they way it is stored internally, data access is much faster. 

**DataFrames**: A table data structure (the inspiration for the Pandas DataFrame in Python), which can hold columns of different data types.


In [7]:
## Create some example data
VECTOR1 = sample(c(1:1000000), 1000000)
VECTOR2 = c(1:1000000)
LIST1 = as.list(VECTOR2)
names(LIST1) = as.character(VECTOR1)
DF1 = data.frame(A=VECTOR1, B=VECTOR2)
rownames(DF1) = as.character(VECTOR1)

In [8]:
mem_used()

218 MB

In [9]:
## How long does it take to find an item?
## Using a vector
t = system.time({idx = match(567890, VECTOR1)})
print("Vector:")
print(idx)
print(t)

## Using a vector version 2
t = system.time({idx = which(VECTOR1 == 567890)})
print("Vector #2:")
print(idx)
print(t)

## Using a list
t = system.time({idx = LIST1[[as.character(567890)]]})
print("List:")
print(idx)
print(t)

## Using a dataframe
t = system.time({idx = match(567890, DF1$A)})
print("DataFrame:")
print(idx)
print(t)

## Using a dataframe version 2
t = system.time({idx = DF1$B[DF1$A == 567890]})
print("DataFrame #2:")
print(idx)
print(t)

## Using a dataframe version 3 (rownames)
t = system.time({idx = DF1[as.character(567890), 'B']})
print("DataFrame #3:")
print(idx)
print(t)

[1] "Vector:"
[1] 453591
   user  system elapsed 
  0.002   0.000   0.002 
[1] "Vector #2:"
[1] 453591
   user  system elapsed 
  0.003   0.000   0.003 
[1] "List:"
[1] 453591
   user  system elapsed 
  0.218   0.000   0.219 
[1] "DataFrame:"
[1] 453591
   user  system elapsed 
  0.005   0.003   0.008 
[1] "DataFrame #2:"
[1] 453591
   user  system elapsed 
  0.006   0.000   0.007 
[1] "DataFrame #3:"
[1] 453591
   user  system elapsed 
  0.039   0.000   0.038 


In [10]:
## How long does it take to determine if an item exists?
x = 567890
## Using a vector
t = system.time({test = x %in% VECTOR1})
print("Vector:")
print(test)
print(t)

## Using a list
t = system.time({test = as.character(x) %in% names(LIST1)})
print("List:")
print(test)
print(t)

## Using a dataframe
t = system.time({test = x %in% DF1$A})
print("DataFrame:")
print(test)
print(t)

## Using a dataframe version 2
t = system.time({test = x %in% rownames(DF1)})
print("DataFrame #2:")
print(test)
print(t)

[1] "Vector:"
[1] TRUE
   user  system elapsed 
  0.005   0.003   0.007 
[1] "List:"
[1] TRUE
   user  system elapsed 
  0.204   0.000   0.204 
[1] "DataFrame:"
[1] TRUE
   user  system elapsed 
  0.002   0.000   0.002 
[1] "DataFrame #2:"
[1] TRUE
   user  system elapsed 
  0.008   0.000   0.008 


In [11]:
## Now let's compare a list to an environment
ENV1 = as.environment(LIST1)

In [12]:
mem_used()

404 MB

In [13]:
## How long does it take to find an item?
## Using a list
t = system.time({idx = LIST1[[as.character(567890)]]})
print("List:")
print(idx)
print(t)

## Using an environment
t = system.time({idx = ENV1[[as.character(567890)]]})
print("Environment:")
print(idx)
print(t)

[1] "List:"
[1] 453591
   user  system elapsed 
  0.009   0.000   0.008 
[1] "Environment:"
[1] 453591
   user  system elapsed 
      0       0       0 


In [14]:
## How long does it take to determine if an item exists?
x = 567890
## Using a list
t = system.time({test = as.character(x) %in% names(LIST1)})
print("List:")
print(test)
print(t)

## Using a list
t = system.time({test = exists(as.character(x), where=ENV1)})
print("Environment:")
print(test)
print(t)

[1] "List:"
[1] TRUE
   user  system elapsed 
  0.005   0.002   0.007 
[1] "Environment:"
[1] TRUE
   user  system elapsed 
  0.001   0.000   0.001 


## dbSNP Dataset

For the following examples, we'll be using data from dbSNP, which contains information about all single nucleotide polymorphisms (SNPs) on human chromosome 1. The data file is a tab-delimited text file containing four columns: the 'rs' number of the SNP, the chromosome, the position, and a comma-separated list of genes at the same location. Note: the file contains a multi-line header.

In [15]:
print(system("head ./xdata/chr1_reducedCols.txt", intern=TRUE))

 [1] "dbSNP Chromosome Report"                                                                 
 [2] "Refer to ftp://ftp.ncbi.nlm.nih.gov/snp/00readme for documentation on tabular data below"
 [3] ""                                                                                        
 [4] "rs#\tchr\tchr\tlocal"                                                                    
 [5] "\t\tpos\tloci"                                                                           
 [6] ""                                                                                        
 [7] ""                                                                                        
 [8] "171\t1\t175261679\t"                                                                     
 [9] "242\t1\t20869461\t"                                                                      
[10] "538\t1\t6160958\tKCNAB2"                                                                 


## Connecting to Relational DBs in R

[https://cran.r-project.org/web/packages/RMySQL/index.html](https://cran.r-project.org/web/packages/RMySQL/index.html)

We'll be connecting to the same DB as last time. The R package `RMySQL` will connect to the database using connection settings stored in a configuration file in your home directory (`~/.my.cnf`). The file should contain 'groups' of settings for databases that you connect to frequently. For example, the following should be entered in the configuration file to allow you to connect to a database called 'bmi535' (the square brackets indicate a 'group', and you can have multiple of these in the same file).

    [bmi535_snps]
    host=localhost
    user=mooneymi
    password=mypassword
    database=bmi535_snps

****Note: with the latest version of MySQL you may get an "Authentication plugin 'caching_sha2_password' cannot be loaded" error. You can avoid this error by changing the user's authentication type (mysql_native_password).**

In [16]:
## Connect to the MySQL database using connection settings defined in ~/.my.cnf
conn = dbConnect(RMySQL::MySQL(), group="bmi535_snps")

In [17]:
## Let's query the DB
system.time({query = "SELECT * FROM snps WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
res = dbSendQuery(conn, query)
rows = dbFetch(res)})

   user  system elapsed 
  0.001   0.000   5.747 

In [18]:
rows
dbClearResult(res)

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [19]:
## Now let's query the DB using the indexed table
system.time({query = "SELECT * FROM snps_idx WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
res = dbSendQuery(conn, query)
rows = dbFetch(res)})

   user  system elapsed 
  0.001   0.000   0.005 

In [20]:
rows
dbClearResult(res)

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [21]:
## Disconnect from the DB
dbDisconnect(conn)

## `data.table`

The `data.table` package implements what is essentially an optimized dataframe. 

[https://cran.r-project.org/web/packages/data.table/index.html](https://cran.r-project.org/web/packages/data.table/index.html)

[https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)

In [22]:
## Let's start by loading the data into a standard R dataframe
## Note we can load directly from a compressed file (gzip)
## This takes a few minutes
mem_used()
system.time({snps = read.delim('./xdata/chr1_reducedCols.txt.gz', header=F, skip=7, sep='\t', 
                               col.names=c('rs', 'chr', 'pos', 'loci'), as.is=T, na.strings=c('NA', '', ' '))})
mem_used()

404 MB

   user  system elapsed 
 81.462   1.217  82.878 

779 MB

In [23]:
## Get the size of the dataframe
dim(snps)

In [24]:
## View the first few rows
head(snps)

rs,chr,pos,loci
171,1,175261679,
242,1,20869461,
538,1,6160958,KCNAB2
546,1,93617546,TMED5
549,1,15546825,TMEM51
568,1,203713133,ATP2B4


In [25]:
## Let's look at the data types for each column
sapply(snps, class)

****It's important to note here that (IMO) R does a better job with missing values and data types than Pandas. For example, missing values are allowed in both character and numeric columns, and don't require special treatment. Of course, we saw that the data cleaning in Pandas was fairly easy if you know what to look for.**

In [26]:
## Search the dataframe for a specific row
## Note: here we wrap the condition inside which() to exclude rows with NAs
system.time({row = snps[which(with(snps, chr==1 & pos==225512846 & loci=='DNAH14')), ]})
row

   user  system elapsed 
  0.265   0.065   0.330 

Unnamed: 0,rs,chr,pos,loci
3456789,189425743,1,225512846,DNAH14


In [27]:
## Don't make the query more complicated than it needs to be
system.time({row = snps[which(with(snps, pos==225512846)), ]})
row

   user  system elapsed 
  0.045   0.011   0.058 

Unnamed: 0,rs,chr,pos,loci
3456789,189425743,1,225512846,DNAH14


### Load Data into a `data.table`

In [28]:
## Load SNP data into data.table
mem_change({snps_dt = as.data.table(snps)})

245 MB

In [29]:
## General syntax for "querying" a data.table is DT[i, j, by]
## For example, here we will ask: How many SNPs are mapped to each gene (for just the first 50 SNPs)?
snps_dt[1:50, list(.N), by="loci"]

loci,N
,9
KCNAB2,1
TMED5,1
TMEM51,1
ATP2B4,1
FUCA1,1
"C1orf123,CPT2",1
SERPINC1,1
AGT,1
PTP4A2,1


In [30]:
## Query the data.table
system.time({row = snps_dt[chr==1 & pos==225512846 & loci=='DNAH14',]})
row

   user  system elapsed 
  2.862   0.210   0.780 

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [31]:
## Add a key to the data.table
setkey(snps_dt, pos)

In [32]:
## Query the data.table with key
system.time({row = snps_dt[chr==1 & pos==225512846 & loci=='DNAH14']})
row

   user  system elapsed 
  1.988   0.141   0.540 

rs,chr,pos,loci
189425743,1,225512846,DNAH14


In [33]:
## Matching on only the key improves performance significantly over 
## a regular dataframe
system.time({row = snps_dt[pos==225512846]})
row

   user  system elapsed 
  0.005   0.001   0.001 

rs,chr,pos,loci
189425743,1,225512846,DNAH14


## BigMemory

The `bigmemory` package allows for storing large datasets in shared-memory and file-backed data structures. This allows for large data structures to be shared across multiple R processes to facilitate efficient parallel processing. 

[https://cran.r-project.org/web/packages/bigmemory/index.html](https://cran.r-project.org/web/packages/bigmemory/index.html)

One caveat is that `bigmemory` creates matrices, which will handle only a single data type, unlike dataframes. By default, character columns in dataframes will be converted to factors and factors converted to numeric levels. The `ff` package may be a better solution if you must have multiple data types in the same object.

Similar to `bigmemory`, the `ff` package allows for on-disk storage of large datasets with efficient data access and the ability to share the same data structure across multiple R processes. I won't cover this package here, but see link below for more info. 

[https://cran.r-project.org/web/packages/ff/index.html](https://cran.r-project.org/web/packages/ff/index.html)

In [34]:
## Let's create a file-backed bigmatrix object using only 
## the first 3 columns of the snps dataframe
## First check that the file doesn't exist
if (file.exists("./xdata/snps_bigmem.bin")) {
    file.remove("./xdata/snps_bigmem.bin")
    file.remove("./xdata/snps_bigmem.bin.desc")
}

In [35]:
## Create a bigmatrix object with the first three columns of the SNPs dataframe (all integers)
## Note: we are first converting the dataframe to a numeric matrix
mem_change({snps_bm = as.big.matrix(as.matrix(snps[,1:3]), type="integer", backingfile="snps_bigmem.bin", backingpath="./xdata")})

“No descriptor file given, it will be named snps_bigmem.bin.desc”

143 kB

In [36]:
## How much space is used on disk
print(system("ls -lt ./xdata/snps_bigmem*", intern=TRUE))

[1] "-rw-r--r--  1 mooneymi  1971611142        351 Jan  9 08:03 ./xdata/snps_bigmem.bin.desc"
[2] "-rw-r--r--  1 mooneymi  1971611142  146855317 Jan  9 08:03 ./xdata/snps_bigmem.bin"     


In [37]:
head(snps_bm)

rs,chr,pos
171,1,175261679
242,1,20869461
538,1,6160958
546,1,93617546
549,1,15546825
568,1,203713133


In [38]:
## Use the mwhich() function to query the bigmatrix object
system.time({row = snps_bm[mwhich(snps_bm, c('chr','pos'), c(1, 225512846), c('eq', 'eq')), ]})
row

   user  system elapsed 
  0.345   0.001   0.345 

The performance for searching is pretty poor, but keep in mind that `bigmemory` was designed with numeric matrices in mind, not tables of heterogeneous data. 

In [39]:
## Data access is pretty fast for specific data elements
## i.e. selecting a specific index
system.time({z = snps_bm[1200:1220,]})

   user  system elapsed 
      0       0       0 

In [40]:
z

rs,chr,pos
12361,1,224564377.0
12371,1,180163390.0
12375,1,10596341.0
12384,1,32256166.0
12386,1,36068863.0
12395,1,38268836.0
12419,1,44686322.0
12439,1,25169634.0
12442,1,43829177.0
12455,1,55533917.0


### `bigalgebra` 

The `bigalgebra` package allows efficient linear algebra operations on `bigmemory` matrices.

[https://cran.r-project.org/web/packages/bigalgebra/index.html](https://cran.r-project.org/web/packages/bigalgebra/index.html)

In [41]:
if (file.exists("./xdata/bigmem.bin")) {
    file.remove("./xdata/bigmem.bin")
    file.remove("./xdata/bigmem.bin.desc")
}
## Let's create another on-disk bigmatrix
bm = as.big.matrix(matrix(runif(1000000), 10000, 100), type="double", backingfile="bigmem.bin", backingpath="./xdata")

“No descriptor file given, it will be named bigmem.bin.desc”

In [42]:
## Create another bigmatrix object with just the first row
v = as.big.matrix(bm[1,])

“Coercing vector to a single-column matrix.”

In [43]:
## Dimensions of the matrix
dim(bm)

In [44]:
## Dimensions of the vector
dim(v)

In [45]:
## Calculate the dot product
system.time({x = bm %*% v})

   user  system elapsed 
  0.006   0.002   0.005 

In [46]:
dim(x)

## HDF5

The `hdf5r` package provides an interface between R and the HDF5 libraries, much like PyTables in Python. However, `hdf5r` has fairly limited functionality, so it is not as useful for querying data sets. But it can be useful for storing large datasets and accessing/processing chunks of that data.

[https://cran.r-project.org/package=hdf5r](https://cran.r-project.org/package=hdf5r)

***Important Note:** R stores matrices differently than Python. This results in matrices being transposed when created in Python then loaded in R. More info on why this occurs is here: https://cran.r-project.org/web/packages/reticulate/vignettes/arrays.html

In [47]:
## Let's use the hdf5r package to look at our previously saved HDF5 file
mem_used()
h5_file_pandas = H5File$new('./xdata/snps_pandas_hdf_zlib.h5', mode='r')
h5_file_pandas
mem_used()

1.08 GB

Class: H5File
Filename: /Users/mooneymi/Documents/github/large_scale_data/xdata/snps_pandas_hdf_zlib.h5
Access type: H5F_ACC_RDONLY
Attributes: TITLE, CLASS, VERSION, PYTABLES_FORMAT_VERSION
Listing:
 name  obj_type dataset.dims dataset.type_class
 snps H5I_GROUP         <NA>               <NA>

1.08 GB

In [48]:
## The ls() method gives detailed info about file and group objects
h5_file_pandas$ls()

name,link.type,obj_type,num_attrs,group.nlinks,group.mounted,dataset.rank,dataset.dims,dataset.maxdims,dataset.type_class,dataset.space_class,committed_type
snps,0,2,16,1,0,,,,,,


In [49]:
## Access groups and objects by name (similar to lists)
snps_grp = h5_file_pandas[['snps']]
snps_grp$ls()

name,link.type,obj_type,num_attrs,group.nlinks,group.mounted,dataset.rank,dataset.dims,dataset.maxdims,dataset.type_class,dataset.space_class,committed_type
table,0,5,27,,,1,12237943,inf,6,1,


In [50]:
snps_table = snps_grp[['table']]
snps_table[1:5]

index,rs,chr,pos,loci
0,171,1,175261679,
1,242,1,20869461,
2,538,1,6160958,KCNAB2
3,546,1,93617546,TMED5
4,549,1,15546825,TMEM51


In [51]:
mem_used()

1.08 GB

In [52]:
## View data types
snps_table$get_type()

Class: H5T_COMPOUND
Datatype: H5T_COMPOUND {
      H5T_STD_I64LE "index" : 0;
      H5T_STD_I64LE "rs" : 8;
      H5T_STD_I64LE "chr" : 16;
      H5T_IEEE_F64LE "pos" : 24;
      H5T_STRING {
         STRSIZE 76;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      } "loci" : 32;
   }

In [53]:
## Close the file
h5_file_pandas$close_all()

### Creating an HDF5 file with `hdf5r`

For this example we'll create two groups in the HDF5 file, one to hold the SNPs table (compound data type), and another to hold a random numeric matrix.

In [54]:
## Create a new HDF5 file
if (file.exists("./xdata/snps_hdf5r.h5")) {
    file.remove("./xdata/snps_hdf5r.h5")
}
new_h5_file = H5File$new("./xdata/snps_hdf5r.h5", mode='w')

In [55]:
## Create two groups
snps_grp = new_h5_file$create_group('snps')
mat_grp = new_h5_file$create_group('matrices')

In [56]:
snps_grp

Class: H5Group
Filename: /Users/mooneymi/Documents/github/large_scale_data/xdata/snps_hdf5r.h5
Group: /snps

In [57]:
## Add the snps dataframe as a dataset
snps_grp[['table']] = snps

In [58]:
## Note: compound datasets can't be indexed in 2 dimensions
snps_grp[['table']][1:5]

rs,chr,pos,loci
171,1,175261679,
242,1,20869461,
538,1,6160958,KCNAB2
546,1,93617546,TMED5
549,1,15546825,TMEM51


In [59]:
snps_grp[['table']]$get_type()

Class: H5T_COMPOUND
Datatype: H5T_COMPOUND {
      H5T_STD_I32LE "rs" : 0;
      H5T_STD_I32LE "chr" : 4;
      H5T_STD_I32LE "pos" : 8;
      H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      } "loci" : 16;
   }

In [60]:
## Now let's add a numeric matrix to the file
m = matrix(runif(1000000), 10000, 100)
mat_grp[['A']] = m

In [61]:
mat_grp

Class: H5Group
Filename: /Users/mooneymi/Documents/github/large_scale_data/xdata/snps_hdf5r.h5
Group: /matrices
Listing:
 name    obj_type dataset.dims dataset.type_class
    A H5I_DATASET  10000 x 100          H5T_FLOAT

In [62]:
mat_grp[['A']]$get_type()

Class: H5T_FLOAT
Datatype: H5T_IEEE_F64LE

In [63]:
## Because this is a numeric array, we can index (subset) in both dimensions
mat_grp[['A']][1:5,1:5]

0,1,2,3,4
0.3780405,0.2029585,0.1962739,0.9548512,0.9926566
0.5090544,0.4183603,0.1213849,0.8604248,0.4153542
0.7913487,0.6674395,0.3249268,0.4707616,0.7851695
0.1326862,0.4396432,0.3010942,0.4232426,0.22379
0.3257384,0.7836584,0.2457535,0.2047286,0.9905719


In [64]:
## View the file structure
new_h5_file

Class: H5File
Filename: /Users/mooneymi/Documents/github/large_scale_data/xdata/snps_hdf5r.h5
Access type: H5F_ACC_RDWR
Listing:
     name  obj_type dataset.dims dataset.type_class
 matrices H5I_GROUP         <NA>               <NA>
     snps H5I_GROUP         <NA>               <NA>

In [65]:
## Close the file
new_h5_file$close_all()

## In-Class Exercises

In [66]:
## Exercise 1.
## Use parallel processes to calculate the column sums 
## of the first ten columns of a file-backed bigmatrix or HDF5 file
## (you can use 'bm', or the 'snps_hdf5r.h5' defined above).
## Use describe() and attach.big.matrix() from the bigmemory library.
## OR use H5File$new() from the hdf5r library.
## An easy option for parallel R processes is mclapply() or parLapply() from the
## parallel R package (mclapply does not work on Windows)



In [67]:
## Here's an example of sequentially calling foo() 10 times
system.time({lapply(1:10, function(x){foo(x, 1000000)})})

   user  system elapsed 
  0.039   0.015  30.115 

In [68]:
## Here's an example of using mclapply() to call foo() in parallel, using 4 cores
system.time({mclapply(1:10, function(x){foo(x, 1000000)}, mc.cores=4)})

   user  system elapsed 
  0.026   0.038   9.051 

In [69]:
## On Windows machines use parLapply instead of mclapply
## First create a local cluster by specifying the number of cores
cl = makeCluster(4)
## Export variables or function definitions to each node of the cluster
clusterExport(cl, c("foo"))
## If your code depends on any packages you'll also have to load them on the cluster, e.g.:
#parLapply(cl, 1:length(cl), function(x){require(bigmemory)})

## Run your code in parallel
system.time({parLapply(cl, 1:10, function(x){foo(x, 1000000)})})
stopCluster(cl)

   user  system elapsed 
  0.005   0.001   9.053 

#### Last Updated: 8-Jan-2021