-
Notifications
You must be signed in to change notification settings - Fork 9
/
mvnfast.Rmd
237 lines (184 loc) · 9.38 KB
/
mvnfast.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "An Introduction to mvnfast"
author: "Matteo Fasiolo"
date: "`r Sys.Date()`"
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{mvnfast_vignette}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(out.extra='style="display:block; margin: auto"', fig.align="center", tidy=FALSE)
```
Introduction
------------
The `mvnfast` R package provides computationally efficient tools related to the multivariate normal and Student's t distributions.
The tools are generally faster than those provided by other packages, thanks to the use of C++ code through the
`Rcpp`\\`RcppArmadillo` packages and parallelization through the `OpenMP` API. The most important functions are:
- `rmvn()`: simulates multivariate normal random vectors.
- `rmvt()`: simulates Student's t normal random vectors.
- `dmvn()`: evaluates the probability density function of a multivariate normal distribution.
- `dmvt()`: evaluates the probability density function of a multivariate Student's t distribution.
- `maha()`: evaluates mahalanobis distances.
In the following sections we will benchmark each function against equivalent functions provided by other packages, while in the final section we provide an example application.
Simulating multivariate normal or Student's t random vectors
----------------------------
Simulating multivariate normal random variables is an essential step in many Monte Carlo algorithms (such as MCMC or Particle Filters),
hence this operations has to be as fast as possible. Here we compare the `rmvn` function with the equivalent function `rmvnorm`
(from the `mvtnorm` package) and `mvrnorm` (from the `MASS` package). In particular, we simulate $10^4$ twenty-dimensional random vectors:
```{r pack, message=F, warning=F}
# microbenchmark does not work on all platforms, hence we need this small wrapper
microwrapper <- function(..., times = 100L){
ok <- "microbenchmark" %in% rownames(installed.packages())
if( ok ){
library("microbenchmark")
microbenchmark(list = match.call(expand.dots = FALSE)$..., times = times)
}else{
message("microbenchmark package is not installed")
return( invisible(NULL) )
}
}
library("mvtnorm")
library("mvnfast")
library("MASS")
# We might also need to turn off BLAS parallelism
library("RhpcBLASctl")
blas_set_num_threads(1)
N <- 10000
d <- 20
# Creating mean and covariance matrix
mu <- 1:d
tmp <- matrix(rnorm(d^2), d, d)
mcov <- tcrossprod(tmp, tmp)
microwrapper(rmvn(N, mu, mcov, ncores = 2),
rmvn(N, mu, mcov),
rmvnorm(N, mu, mcov),
mvrnorm(N, mu, mcov))
```
In this example `rmvn` cuts the computational time, relative to the alternatives, even when a single core is used. This gain is attributable to several factors: the use of C++ code and efficient numerical algorithms to simulate the random variables. Parallelizing the computation on two cores gives another appreciable speed-up. To be fair, it is necessary to point out that `rmvnorm` and `mvrnorm` have many more safety check on the user's input than `rmvn`. This is true also for the functions described in the next sections.
Notice that this function does not use one of the Random Number Generators (RNGs) provided by R, but one of the parallel cryptographic RNGs described in (Salmon et al., 2011). It is important to point out that this RNG can safely be used in parallel, without risk of collisions between parallel sequence of random numbers, as detailed in the above reference.
We get similar performance gains when we simulate multivariate Student's t random variables:
```{r rmvt}
# Here we have a conflict between namespaces
microwrapper(mvnfast::rmvt(N, mu, mcov, df = 3, ncores = 2),
mvnfast::rmvt(N, mu, mcov, df = 3),
mvtnorm::rmvt(N, delta = mu, sigma = mcov, df = 3))
```
When `d` and `N` are large, and `rmvn` or `rmvt` are called several times with the same arguments, it would make sense to create the matrix where to store the simulated random variable upfront. This can be done as follows:
```{r rmvnA}
A <- matrix(nrow = N, ncol = d)
class(A) <- "numeric" # This is important. We need the elements of A to be of class "numeric".
rmvn(N, mu, mcov, A = A)
```
Notice that here `rmvn` returns `NULL`, not the simulated random vectors! These can be found in the matrix provided by the user:
```{r rmvnA1}
A[1:2, 1:5]
```
Pre-creating the matrix of random variables saves some more time:
```{r rmvnA2}
microwrapper(rmvn(N, mu, mcov, ncores = 2, A = A),
rmvn(N, mu, mcov, ncores = 2),
times = 200)
```
Don't look at the median time here, the mean is much more affected by memory re-allocation.
Evaluating the multivariate normal and Student's t densities
----------------------------
Here we compare the `dmvn` function, which evaluates the multivariate normal density, with the equivalent function `dmvtnorm` (from the `mvtnorm` package).
In particular we evaluate the log-density of $10^4$ twenty-dimensional random vectors:
```{r dmvn}
# Generating random vectors
N <- 10000
d <- 20
mu <- 1:d
tmp <- matrix(rnorm(d^2), d, d)
mcov <- tcrossprod(tmp, tmp)
X <- rmvn(N, mu, mcov)
microwrapper(dmvn(X, mu, mcov, ncores = 2, log = T),
dmvn(X, mu, mcov, log = T),
dmvnorm(X, mu, mcov, log = T), times = 500)
```
Again, we get some speed-up using C++ code and some more from the parallelization. We get similar results if we use a multivariate Student's t density:
```{r dmvt}
# We have a namespace conflict
microwrapper(mvnfast::dmvt(X, mu, mcov, df = 4, ncores = 2, log = T),
mvnfast::dmvt(X, mu, mcov, df = 4, log = T),
mvtnorm::dmvt(X, delta = mu, sigma = mcov, df = 4, log = T), times = 500)
```
Evaluating the Mahalanobis distance
----------------------------
Finally, we compare the `maha` function, which evaluates the square [mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance) with the equivalent function `mahalanobis` (from the `stats` package).
Also in the case we use $10^4$ twenty-dimensional random vectors:
```{r maha}
# Generating random vectors
N <- 10000
d <- 20
mu <- 1:d
tmp <- matrix(rnorm(d^2), d, d)
mcov <- tcrossprod(tmp, tmp)
X <- rmvn(N, mu, mcov)
microwrapper(maha(X, mu, mcov, ncores = 2),
maha(X, mu, mcov),
mahalanobis(X, mu, mcov))
```
The acceleration is similar to that obtained in the previous sections.
Example: mean-shift mode seeking algorithm
----------------------------
As an example application of the `dmvn` function, we implemented the [mean-shift mode seeking](https://en.wikipedia.org/wiki/Mean-shift) algorithm.
This procedure can be used to find the mode or maxima of a kernel density function, and it can be used to set up
clustering algorithms. Here we simulate $10^4$ d-dimensional random vectors from mixture of normal distributions:
```{r mixSim}
set.seed(5135)
N <- 10000
d <- 2
mu1 <- c(0, 0); mu2 <- c(2, 3)
Cov1 <- matrix(c(1, 0, 0, 2), 2, 2)
Cov2 <- matrix(c(1, -0.9, -0.9, 1), 2, 2)
bin <- rbinom(N, 1, 0.5)
X <- bin * rmvn(N, mu1, Cov1) + (!bin) * rmvn(N, mu2, Cov2)
```
Finally, we plot the resulting probability density and, starting from 10 initial points, we use mean-shift to converge to the nearest mode:
```{r mixPlot}
# Plotting
np <- 100
xvals <- seq(min(X[ , 1]), max(X[ , 1]), length.out = np)
yvals <- seq(min(X[ , 2]), max(X[ , 2]), length.out = np)
theGrid <- expand.grid(xvals, yvals)
theGrid <- as.matrix(theGrid)
dens <- dmixn(theGrid,
mu = rbind(mu1, mu2),
sigma = list(Cov1, Cov2),
w = rep(1, 2)/2)
plot(X[ , 1], X[ , 2], pch = '.', lwd = 0.01, col = 3)
contour(x = xvals, y = yvals, z = matrix(dens, np, np),
levels = c(0.002, 0.01, 0.02, 0.04, 0.08, 0.15 ), add = TRUE, lwd = 2)
# Mean-shift
library(plyr)
inits <- matrix(c(-2, 2, 0, 3, 4, 3, 2, 5, 2, -3, 2, 2, 0, 2, 3, 0, 0, -4, -2, 6),
10, 2, byrow = TRUE)
traj <- alply(inits,
1,
function(input)
ms(X = X,
init = input,
H = 0.05 * cov(X),
ncores = 2,
store = TRUE)$traj
)
invisible( lapply(traj,
function(input){
lines(input[ , 1], input[ , 2], col = 2, lwd = 1.5)
points(tail(input[ , 1]), tail(input[ , 2]))
}))
```
As we can see from the plot, each initial point leads one of two points that are very close to the true mode. Notice that the bandwidth for the kernel density estimator was chosen by trial-and-error, and less arbitrary choices are certainly possible in real applications.
References
----------------------------
* Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8),
1-18. URL https://www.jstatsoft.org/v40/i08/.
* Eddelbuettel, Dirk (2013) Seamless R and C++ Integration with Rcpp. Springer, New York. ISBN 978-1-4614-6867-7.
* Dirk Eddelbuettel, Conrad Sanderson (2014). RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational
Statistics and Data Analysis, Volume 71, March 2014, pages 1054-1063. URL https://dx.doi.org/10.1016/j.csda.2013.02.005
* https://www.openmp.org/
* John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw (2011). Parallel Random Numbers: As Easy as 1, 2, 3.
D. E. Shaw Research, New York, NY 10036, USA.