roll_lm gets slower with repeated iterations in the same R session #1

tieskevdh · 2016-08-22T04:05:44Z

I find that computation time when using 'roll_lm' repetitively increases substantially. This is true for different data sets as well as for the same data set. I've attached an example together with the input data and the profvis output. This runs rolling window regressions for a bunch of firms on annual data, and repeats that analysis a number of times, in this case on the same data but in my actual application I have a series of simulated data sets on which I want to run this.

For a series of five iterations, the time taken by roll_lm is as follows on my machine

[[1]]
user system elapsed
15.79 5.17 6.22

[[2]]
user system elapsed
41.27 8.44 10.58

[[3]]
user system elapsed
71.13 14.08 15.65

[[4]]
user system elapsed
96.97 18.94 20.40

[[5]]
user system elapsed
135.19 26.80 27.25

Hence, by iteration five the total CPU time is about 9 times as much as for the first iteration, which requires the same computations.

Restarting the R session appears to help, but other than that nothing (e.g., rm()) seems to matter for the performance degradation.

I didn't manage to attach a zip with the code and data, but the file in the link contains the relevant files including session info; I'm using the MRAN version of R 3.3.0 with the MKL libraries.

jasonjfoster · 2016-08-22T23:34:04Z

Thanks, it looks like you are already taking advantage of parallelism by using MKL for the math algorithms and then layering more thread scheduling on top from the parallel algorithms used in the 'roll' package via RcppParallel. This means the multithreaded linear algebra operations and parallel "for" loops compete for the same processor and actually lose performance.

For two possible solutions, can you try to reduce the number of threads to one for either MKL or RcppParallel:

MKL: setMKLthreads(1) (see https://mran.microsoft.com/documents/rro/multithread/#mt-setget)
RcppParallel: setThreadOptions(numThreads = 1) (see https://rcppcore.github.io/RcppParallel/#threads_used)

This is noted in ?roll_lm:

If users are already taking advantage of parallelism using multithreaded BLAS/LAPACK libraries, then limit the number of cores in the RcppParallel package to one with the setThreadOptions function.

Also, if you're interested in learning more, here are some additional references:

tieskevdh · 2016-08-23T00:39:58Z

Thanks, it looks like this is something related to the RcppParallel package. If I switch both thread counts to one, it works fine without slowing down on repetition.

If I leave the MKL threadcount at 1 and switch the RcppParallel to the number of cores (6), the performance is much worse on any iteration, and gets worse with repetitions of the roll_lm function.

If I then change the MKL thread back to the number of cores (6), while leaving the RcppParallel threadcount at 1, the performance is effectively the same as with both threadcounts set to 1 (presumably because a regression with five observations does not have much opportunity for multithreading to increase performance).

The attached text file has the details. Each time I restart the R session before changing a threadcount to make sure the results are not driven by some existing code/variables.

thread_testing.txt

jasonjfoster · 2016-08-23T23:40:02Z

Based on your results, when the number of threads is reduced to one for RcppParallel it works as expected in both cases, but not when setMKLthreads(1) and setThreadOptions is set to the max (?) number of threads. In the latter, my guess is the two are still competing for the same processor and (again) lose performance. Can you try a case where setMKLthreads(1) is fixed and setThreadOptions(numThreads) uses a value less than the max. Note that the max value for setMKLthreads and setThreadOptions is the number of cores and threads, respectively, which can differ. See ?getMKLthreads() and ?defaultNumThreads().

tieskevdh · 2016-08-24T00:06:10Z

In the attached file, I've fixed the number of MKL threads at 1, and vary the number of RcppParallel threads between different runs. Within each run the same roll_lm model is estimated five times in sequence. The R session is restarted between runs.

The machine has 6 physical cores (12 threads with hyperthreading), but when using more than 1 RcppParallel thread the performance is already lower than with a single thread. For 2 RcppParallel threads the performance is relatively constant across the 5 repetitions. For 3 or more RcppParallel threads, the performance gets progressively worse as we rerun the same roll_lm model, with the fifth iteration about 4 times slower than the first one when using 5 or 6 RcppParallel threads.

So I understand that the overhead of the RcppParallel may be too much for this small task, which explains why the performance is worse for 2 RcppParallel threads compared to a single thread. However, neither that nor competing resources can explain why if I rerun the same model multiple times the performance gets worse, or am I missing something?

thread_testing_2.txt

jasonjfoster · 2016-08-24T01:38:11Z

Thanks, I think this is an important question. Let's first try to determine whether it's caused by 'roll' or something else like 'MRO', 'RcppParallel', etc. (in which case, we can ask the experts). For example, after a quick search, there is a similar question on the MRO forum that asked about why the "MKL multithreaded library and mclapply do not play well together": https://groups.google.com/forum/#!topic/rropen/MG98lXsFepo (notice that the times in the setMKLthreads(1) cases also degrade -- may be worth a follow-up question or post later).

I put together the following as a reproducible example that will help others follow along more easily. It's a basic RcppParallel function that uses the math libraries via RcppArmadillo to compute the rolling matrix inverse here. Then can you run and vary the thread count and let me know the result:

library(Rcpp)
library(RcppParallel)
sourceCpp("20160823.cpp")

n_vars <- 150
n_obs <- 1000

set.seed(1)
x <- array(rnorm(n_obs * n_vars * n_vars), dim = c(n_vars, n_vars, n_obs))

system.time(parallelMatrixInv(x))

tieskevdh · 2016-08-24T04:04:23Z

OK, I've run your code, both using MRO 3.3.0 with either 1 MKL thread or 6 MKL threads, and also using stock R 3.3.1 with the standard BLAS library. In all cases, I vary the number of RcppParallel threads from 1 to 6. For each setting, I run the matrix inversion (all on the same matrix) five times.

Output in the attached files. Takeaways:

standard BLAS is slightly faster than MKL for this operation
MKL and RcppParallel threads don't seem to mind each other as the performance is quite similar between 1 MKL thread and 6 MKL threads for a given number of RcppParallel threads.
There is no performance degradation across the five repeats for any given setting, i.e., combination of MKL threads and RcppParallel threads.

The latter result to me suggests that there is something in roll that causes the interplay between MKL and RcppParallel threads to break, and increasingly so when repeating the same analysis within a fixed R session.

thread_testing_matrixinverse_noMKL.txt
thread_testing_matrixinverse_MKL1.txt
thread_testing_matrixinverse_MKL6.txt

jasonjfoster · 2016-08-24T10:46:14Z

Thanks, it may also mean I need to modify the reproducible example if multithreading has no impact. In the meantime, can you contact me directly in case I have minor follow-up questions? Then we can report back here with the conclusion.

jasonjfoster · 2016-08-25T10:50:12Z

This issue was resolved by modifying the Makevars files and is now fixed in the development version:

# install.packages("devtools")
devtools::install_github("jjf234/roll")

jasonjfoster closed this as completed Aug 25, 2016

jasonjfoster added the bug label Sep 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roll_lm gets slower with repeated iterations in the same R session #1

roll_lm gets slower with repeated iterations in the same R session #1

tieskevdh commented Aug 22, 2016

jasonjfoster commented Aug 22, 2016

tieskevdh commented Aug 23, 2016

jasonjfoster commented Aug 23, 2016

tieskevdh commented Aug 24, 2016

jasonjfoster commented Aug 24, 2016

tieskevdh commented Aug 24, 2016 •

edited

jasonjfoster commented Aug 24, 2016

jasonjfoster commented Aug 25, 2016

roll_lm gets slower with repeated iterations in the same R session #1

roll_lm gets slower with repeated iterations in the same R session #1

Comments

tieskevdh commented Aug 22, 2016

jasonjfoster commented Aug 22, 2016

tieskevdh commented Aug 23, 2016

jasonjfoster commented Aug 23, 2016

tieskevdh commented Aug 24, 2016

jasonjfoster commented Aug 24, 2016

tieskevdh commented Aug 24, 2016 • edited

jasonjfoster commented Aug 24, 2016

jasonjfoster commented Aug 25, 2016

tieskevdh commented Aug 24, 2016 •

edited