# BMI 535/635: Management & Processing of Large-scale Data

#### Author: Michael Mooney (mooneymi@ohsu.edu)

## Week 3: Data Storage and Querying Solutions in R

1. Introduction
2. Learning Objectives
3. Resource Profiling
4. Review of R Data Types
5. Data from dbSNP
6. Connecting to Relational DBs
7. data.table
8. BigMemory
9. ff
10. HDF5

Requirements:

- R packages:
    - pryr
    - RMySQL
    - data.table
    - bigmemory
    - ff
    - ffbase
    - rhdf5
- Data files:
    - dbSNP annotations (chromosome 1 only): `./data/chr1_reducedCols.txt.gz`
    - A MySQL config file containing connection parameters: `~/.my.cnf`

In [1]:
library(pryr)
library(RMySQL)
library(data.table)
library(bigmemory)
library(ff)
library(ffbase)
library(rhdf5)

Loading required package: DBI

Attaching package: ‘data.table’

The following object is masked from ‘package:pryr’:

    address

Loading required package: bigmemory.sri
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: ‘bit’

The following object is masked from ‘package:data.table’:

    setattr

The following object is masked from ‘package:base’:

    xor

Attaching package ff
- getOption("fftempdir")=="/var/folders/3r/wws_4jz54ms2t6m0jrz8k_6mg3ll58/T//Rtmp1Ayz8r"

- getOption("ffextension")=="ff"

- getOption("ffdrop")==TRUE

- getOption("fffinonexit")==TRUE

- getOption("ffpagesize")==65536

- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writ

## Introduction

This is the **R** version of the previous lecture. We'll be addressing the same big-data issues, but this time exploring solutions offered in R. Just as a reminder, here are the problems often faced when working with large data sets:

1. Data does not fit into memory
    - In particular, this can be a problem when setting up parallel computations, where each process needs the full data
    - R can sometimes present unique challenges when it comes to memory usage. For more info see the following:
    - ([http://adv-r.had.co.nz/memory.html](http://adv-r.had.co.nz/memory.html))
    - `?Memory`
2. Accessing (querying) the data is slow
3. Data files on-disk are very large (i.e. not easily portable)

Potential Solutions:

1. Use on-disk storage that is optimized for fast read/write access
2. Use data storage that allows for multiple concurrent reads (i.e. can be shared across multiple processes)
3. Use data compression

### Learning Objectives

1. You will learn some basic methods for profiling the amount of resources and time used by computational tasks
2. You will learn how store large datasets in various "high-performance" R data structures
3. You will learn how to query data in each of the data structures
4. You will learn how to convert between these various data storage solutions


## Resource Profiling ** not done

`system.time` can be used to 

More information can be found at Hadley Wickham's 'Advanced R' site: 

[http://adv-r.had.co.nz/Profiling.html](http://adv-r.had.co.nz/Profiling.html)

[http://adv-r.had.co.nz/memory.html#memory-profiling](http://adv-r.had.co.nz/memory.html#memory-profiling)

In [None]:
foo = function(a, n=100) {
    Sys.sleep(2)
    b = rep(a, n)
    Sys.sleep(1)
    return(NULL)
}

system.time({foo(1,10000000)})

In [None]:
mem_change({foo(1,10000000)})

In [None]:
mem_used()

## Review of Basic R Data Types

Basic Python data types and when to use them:

**Vectors**: Vectors store collections of data elements of a single type. R will perform automatic type conversions, so be careful and pay attention to you data types. Vectors can be named, so you can access elements by name or by index. Note: set operations can be performed on vectors.

**Lists**: An R list is similar to a Python dictionary.... If you have a large collection of data and need to repeatedly search for specific items, use an environment instead.

**Environments**: An environment can be created and accessed very much like a list, but because of they way it is stored internally, data access is much faster. 

**DataFrames**: A table data structure (the inspiration for the Pandas DataFrame in Python), which can hold columns of different data types.


In [None]:
## Create some example data
VECTOR1 = sample(c(1:1000000), 1000000)
VECTOR2 = c(1:1000000)
LIST1 = as.list(VECTOR2)
names(LIST1) = as.character(VECTOR1)
DF1 = data.frame(A=VECTOR1, B=VECTOR2)
rownames(DF1) = as.character(VECTOR1)

In [None]:
mem_used()

In [None]:
## How long does it take to find an item?
## Using a vector
t = system.time({idx = match(567890, VECTOR1)})
print(idx)
print(t)

## Using a vector version 2
t = system.time({idx = which(VECTOR1 == 567890)})
print(idx)
print(t)

## Using a list
t = system.time({idx = LIST1[[as.character(567890)]]})
print(idx)
print(t)

## Using a dataframe
t = system.time({idx = match(567890, DF1$A)})
print(idx)
print(t)

## Using a dataframe version 2
t = system.time({idx = DF1$B[DF1$A == 567890]})
print(idx)
print(t)

In [None]:
## How long does it take to determine if an item exists?
x = 567890
## Using a vector
t = system.time({test = x %in% VECTOR1})
print(test)
print(t)

## Using a list
t = system.time({test = as.character(x) %in% names(LIST1)})
print(test)
print(t)

## Using a dataframe
t = system.time({test = x %in% DF1$A})
print(test)
print(t)

In [None]:
## Now let's compare a list to an environment
ENV1 = as.environment(LIST1)

In [None]:
mem_used()

In [None]:
## How long does it take to find an item?
## Using a list
t = system.time({idx = LIST1[[as.character(567890)]]})
print(idx)
print(t)

## Using an environment
t = system.time({idx = ENV1[[as.character(567890)]]})
print(idx)
print(t)

In [None]:
## How long does it take to determine if an item exists?
x = 567890
## Using a list
t = system.time({test = as.character(x) %in% names(LIST1)})
print(test)
print(t)

## Using a list
t = system.time({test = exists(as.character(x), where=ENV1)})
print(test)
print(t)

## dbSNP Dataset

For the following examples, we'll be using data from dbSNP, which contains information about all single nucleotide polymorphisms (SNPs) on human chromosome 1. The data file is a tab-delimited text file containing four columns: the 'rs' number of the SNP, the chromosome, the position, and a comma-separated list of genes at the same location. Note: the file contains a multi-line header.

In [None]:
print(system("head ./data/chr1_reducedCols.txt", intern=TRUE))

## Connecting to Relational DBs in R

We'll be connecting to the same DB as last time. The R package `RMySQL` will connect to the database using connection settings stored in a configuration file in your home directory (`~/.my.cnf`). The file should contain 'groups' of settings for databases that you connect to frequently. For example, the following should be entered in the configuration file to allow you to connect to a database called 'bmi535' (the square brackets indicate a 'group', and you can have multiple of these in the same file).

    [bmi535]
    host=localhost
    user=mooneymi
    password=mypassword
    database=bmi535

In [None]:
## Connect to the MySQL database using connection settings defined in ~/.my.cnf
conn = dbConnect(RMySQL::MySQL(), group="bmi535")

In [None]:
## Let's query the DB
system.time({query = "SELECT * FROM snps WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
res = dbSendQuery(conn, query)
rows = dbFetch(res)})

In [None]:
rows
dbClearResult(res)

In [None]:
## Now let's query the DB using the indexed table
system.time({query = "SELECT * FROM snps_idx WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
res = dbSendQuery(conn, query)
rows = dbFetch(res)})

In [None]:
rows
dbClearResult(res)

In [None]:
dbDisconnect(conn)

## `data.table`

`data.table` implements what is essentially an optimized dataframe. 

In [None]:
## Let's start by loading the data into a standard R dataframe
mem_used()
system.time({snps = read.delim('./data/chr1_reducedCols.txt', header=F, skip=7, sep='\t', 
                               col.names=c('rs', 'chr', 'pos', 'loci'), as.is=T, na.strings=c('NA', '', ' '))})
mem_used()

In [None]:
dim(snps)

In [None]:
## View the first few rows
head(snps)

In [None]:
sapply(snps, class)

In [None]:
## Search the dataframe for a specific row
## Note: here we wrap the condition inside which() to exclude rows with NAs
system.time({row = snps[which(with(snps, chr==1 & pos==225512846 & loci=='DNAH14')), ]})
row

In [None]:
system.time({row = snps[which(with(snps, pos==225512846)), ]})
row

### Load Data into a `data.table`

In [None]:
## Load SNP data into data.table
snps_dt = as.data.table(snps)

In [None]:
system.time({row = snps_dt[chr==1 & pos==225512846 & loci=='DNAH14',]})
row

In [None]:
## Add a key to the data.table
setkey(snps_dt, pos)

In [None]:
system.time({row = snps_dt[chr==1 & pos==225512846 & loci=='DNAH14']})
row

In [None]:
system.time({row = snps_dt[pos==225512846]})
row

## BigMemory

The `bigmemory` package allows for storing large datasets in shared-memory and file-backed data structures. This allows for large data structures to be shared across multiple R processes to facilitate efficient parallel processing. 

One caveat is that `bigmemory` creates matrices, which will handle only a single data type, unlike dataframes. By default, character columns in dataframes will be converted to factors and factors converted to numeric levels. The `ff` package discussed below may be a better solution if you must have multiple data types in the same object.

In [None]:
## Let's create a file-backed bigmatrix object using only 
## the first 3 columns of the snps dataframe
mem_used()
snps_bm = as.big.matrix(snps[,1:3], type="integer", backingfile="snps_bigmem.bin", backingpath="./data")
mem_used()

In [None]:
head(snps_bm)

In [None]:
## Use the mwhich() function to query the bigmatrix object
system.time({row = snps_bm[mwhich(snps_bm, c('chr','pos'), c(1, 225512846), c('eq', 'eq')), ]})
row

The performance for searching is pretty poor, but keep in mind that `bigmemory` was designed with matrices in mind, not tables of heterogeneous data. 

## ff

Similar to `bigmemory`, the `ff` packages allows for on-disk storage of large datasets with efficient data access and the ability to share the same data structure across multiple R processes.

In [None]:
getOption("fftempdir")

In [None]:
options("fftempdir"="/Users/mooneymi/Documents/BMI535/Lectures/data")

In [None]:
getOption("fftempdir")

In [None]:
## Let's create a ffdf object
mem_used()
snps_ff = read.delim.ffdf(file='./data/chr1_reducedCols.txt', header=F, skip=7, sep='\t')
mem_used()

In [None]:
colnames(snps_ff) = c('rs','chr','pos','loci')

In [None]:
head(snps_ff)

In [None]:
## Use the ffwhich() function to query the data
system.time({row = snps_ff[ffwhich(snps_ff, chr==1 & pos==225512846 & loci=='DNAH14'), ]})
row

## HDF5

The `rhdf5` package ...

In [2]:
## Let's use the rhdf5 package to look at our previously saved HDF5 file
 h5ls('./data/snps_pandas_hdf_zlib.h5')

Unnamed: 0,group,name,otype,dclass,dim
0,/,snps,H5I_GROUP,,
1,/snps,_i_table,H5I_GROUP,,
2,/snps/_i_table,pos,H5I_GROUP,,
3,/snps/_i_table/pos,abounds,H5I_DATASET,FLOAT,2720
4,/snps/_i_table/pos,bounds,H5I_DATASET,FLOAT,271 x 10
5,/snps/_i_table/pos,indices,H5I_DATASET,INTEGER,1114112 x 10
6,/snps/_i_table/pos,indicesLR,H5I_DATASET,INTEGER,1114112
7,/snps/_i_table/pos,mbounds,H5I_DATASET,FLOAT,2720
8,/snps/_i_table/pos,mranges,H5I_DATASET,FLOAT,10
9,/snps/_i_table/pos,ranges,H5I_DATASET,FLOAT,2 x 10


In [4]:
snps_hdf5 = h5read('./data/snps_pandas_hdf_zlib.h5', 'snps/table', bit64conversion='bit64')

In [None]:
snps_hdf5

In [None]:
H5close()

In [None]:
getwd()

## In-Class Exercises

## References

#### Last Updated: 11-Aug-2017