jaredhuling/rfunctions

a set of R functions that I find useful
C++ R C
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.

rfunctions

some cool R functions that I either wrote, stole, or modified

Installation

rfunctions is not available on CRAN, but can be installed using the R package devtools. rfunctions can be installed with the following R code:

```devtools::install_github("jaredhuling/rfunctions")
library(rfunctions)```

Accelerated crossprod function

A project I've been working on requires fast evaluation of X'X for a design matrix X. I found a great example in the paper for RcppEigen by Douglas Bates and Dirk Eddelbuettel for just such a thing. RcppEigen provides a simple and effective interface between R and the blazing-fast Eigen C++ library for numerical linear algebra. Their example uses inline, a nice tool for inline C++ code in R, and I a made a proper R function from that. The following showcases the speed of Eigen. Note that since X'X is symmetric, we only have to compute half of the values, which further reduces computation time.

```n.obs <- 10000
n.vars <- 100

x <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)

library(microbenchmark)

microbenchmark(crossprodcpp(x), crossprod(x), times = 25L)```
``````## Unit: milliseconds
##             expr   min    lq median    uq   max neval
##  crossprodcpp(x) 11.88 13.09  15.33 16.82 19.34    25
##     crossprod(x) 51.32 56.07  60.98 62.41 72.67    25
``````
`all.equal(crossprodcpp(x), crossprod(x))`
``````## [1] TRUE
``````

`crossprodcpp` can also compute a weighted cross product X'WX where W is a diagonal weight matrix

```ru <- runif(n.obs)
weights <- ru * (1 - ru)

microbenchmark(crossprodcpp(x, weights), crossprod(x, weights * x), times = 25L)```
``````## Unit: milliseconds
##                       expr    min     lq median    uq    max neval
##   crossprodcpp(x, weights)  15.93  16.34  16.68  17.2  24.61    25
##  crossprod(x, weights * x) 105.36 106.59 110.08 113.9 134.62    25
``````
`all.equal(crossprodcpp(x, weights), crossprod(x, weights * x))`
``````## [1] TRUE
``````

Largest Singular Value Computation

The Lanczos algorithm is a well-known method for fast computation of extremal eigenvalues. The Golub-Kahan-Lanczos bidiagonalization algorithm is an extension of this to approximate the largest singular values of a matrix X from below. The function `gklBidiag` approximates the largest singular value of a matrix. Since GKL bidiagonalization is initialized from a random vector, we can compute a probabilistic upper bound for the singular value. The following compares the speed of `gklBidiag` and the implementation in the popular Fortran library PROPACK found in the svd package

```library(svd)

v <- runif(ncol(x))  #initialization for GKL-bidiag
opts <- list(kmax = 30L)
microbenchmark(gklBidiag(x, v, maxit = 30L), propack.svd(x, neig = 1L, opts = opts))```
``````## Unit: milliseconds
##                                    expr    min     lq median     uq    max
##            gklBidiag(x, v, maxit = 30L)  31.61  32.08  32.43  35.12  67.78
##  propack.svd(x, neig = 1L, opts = opts) 230.24 234.09 237.05 242.29 329.95
##  neval
##    100
##    100
``````
`gklBidiag(x, v, maxit = 30L)\$d - propack.svd(x, neig = 1L, opts = opts)\$d`
``````## [1] -2.52e-11
``````

As `gklBidiag` also works on sparse matrices (of the `SparseMatrix` class from the Matrix package), I can showcase another function in rfunctions, `simSparseMatrix`, which unsurprisingly simulates matrices with very few nonzero values. The nonzero values can either be all 1's or generated from a normal distribution. The level of sparsity of the simulated matrix can be specified

```n.obs <- 1e+05
n.vars <- 1000

# simulate a very sparse matrix (this matrix has many zeros and few ones)
x.s.b <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = T)
x.s.c <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = F)
v <- runif(n.vars)

# reorthogonalization sometimes leads to higher accuracy. it helps correct
# for floating-point errors
microbenchmark(gklBidiag(x.s.b, v, maxit = 10L, 0L), gklBidiag(x.s.c, v, maxit = 10L,
0L))```
``````## Unit: milliseconds
##                                  expr   min    lq median     uq   max
##  gklBidiag(x.s.b, v, maxit = 10L, 0L) 82.95 84.54  85.59  92.48 246.8
##  gklBidiag(x.s.c, v, maxit = 10L, 0L) 83.21 85.00  87.00 101.99 243.1
##  neval
##    100
##    100
``````
`gklBidiag(x.s.b, v, maxit = 10L, 0L)\$d`
``````## [1] 104.9
``````
`gklBidiag(x.s.c, v, maxit = 10L, 0L)\$d`
``````## [1] 35.29
``````

Fast Moore-Penrose Generalized Inverse

The speed of `MASS::ginv()` leaves much to be desired, as it calls `svd()` in order to compute the Moore-Penrose generalized inverse of a matrix. I came across this paper, which provides a faster algorithm for the M-P inverse. I've implemented it using RcppEigen. This is useful for least squares problems when the matrix is less than full rank, like below

```n <- 1000
p <- 500

x <- matrix(rnorm(n * (p - 1)), n, p - 1)
x <- cbind(x, rowMeans(x))

## compute X'X
xpx <- crossprodcpp(x)

library(MASS)

## compute generalized inverse of X'X
microbenchmark(geninv(xpx), ginv(xpx))```
``````## Unit: milliseconds
##         expr   min    lq median    uq   max neval
##  geninv(xpx) 185.9 190.4  192.1 199.6 256.2   100
##    ginv(xpx) 498.7 511.6  522.2 534.7 711.0   100
``````
```inv <- geninv(xpx)
inv2 <- ginv(xpx)

## check if we have computed the M-P generalized inverse
all.equal(xpx, xpx %*% inv %*% xpx)```
``````## [1] TRUE
``````
`all.equal(inv, inv2)`
``````## [1] TRUE
``````

This may seem pointless, but I wrote functions to add and subtract matrices. It turns out my functions are faster than using the `+` and `-` operators. I'm sure someone will be quick to point out why using my `add()` and `subtract()` functions is silly and a bad idea.

```n.obs <- 1e+05
n.vars <- 500

A <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = F)
B <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = F)

microbenchmark(add(A, B), A + B, times = 25L)```
``````## Unit: milliseconds
##       expr   min     lq median     uq   max neval
##  add(A, B) 37.59  40.08  40.93  44.32 161.2    25
##      A + B 92.69 100.69 105.72 110.32 218.2    25
``````
`microbenchmark(subtract(A, B), A - B, times = 25L)`
``````## Unit: milliseconds
##            expr   min    lq median    uq   max neval
##  subtract(A, B) 37.42 39.22  40.22  45.8 155.2    25
##           A - B 94.09 97.81 108.23 112.1 120.9    25
``````
`all.equal(add(A, B), A + B)`
``````## [1] TRUE
``````
`all.equal(subtract(A, B), A - B)`
``````## [1] TRUE
``````

The `add()` and `subtract()` methods for dense matrices are slower than the corresponding operators, so they're only worth using when you have sparse matrices.

```n.obs <- 1000
n.vars <- 1000

A <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)
B <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)

``````## Unit: milliseconds
##       expr   min    lq median    uq   max neval
##  add(A, B) 4.801 6.286  6.432 7.201 24.76   100
##      A + B 1.677 3.018  3.207 3.472 19.32   100
``````
`microbenchmark(subtract(A, B), A - B)`
``````## Unit: milliseconds
##            expr   min    lq median    uq   max neval
##  subtract(A, B) 4.838 6.260  6.347 6.512 19.38   100
##           A - B 1.845 2.996  3.054 3.198 20.56   100
``````
`all.equal(add(A, B), A + B)`
``````## [1] TRUE
``````
`all.equal(subtract(A, B), A - B)`
``````## [1] TRUE
``````