Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerating darch using MKL #18

Closed
lizhongz opened this issue Aug 17, 2016 · 4 comments
Closed

Accelerating darch using MKL #18

lizhongz opened this issue Aug 17, 2016 · 4 comments
Assignees
Labels

Comments

@lizhongz
Copy link

lizhongz commented Aug 17, 2016

Hi, my group wants to use darch for our DNN. The training took very long about 1.5 days for our use case, so we decided to speed it up by exploiting Intel MKL which automatically offloads some computations to our Xeon Phi coprocessors.

I have recompiled R using MKL and linked it to MKL's BLAS and LAPACK. MKL is able to offload computations to Xeon Phi for operations like matrix multiplication. However, MKL automatic offloading does not happen when darch is running. I was wondering if darch uses R's default BLSA or LAPACK (in this case, MKL BLAS and LAPACK), or its own implementation. If not, is the a way to explore MKL and Xeon Phi?

Thanks,
-- Lizhong

@saviola777
Copy link
Collaborator

Hello,

as detailed here, MKL support is working on my test machine when gputools has been (left) disabled. darch uses R's default implementations for matrix multiplication in most cases, but some algorithms have been written in C++, which provides a speedup for single-core systems but may be a slowdown when using MKL, but definitely not to the degree that "automatic offloading does not happen". Maybe I should provide parameters to disable these C++ implementations.

Please provide more details about the parameters and dataset used to run darch so that I may reproduce the MKL issue. What behavior do you see when using darch 0.10?

@lizhongz
Copy link
Author

lizhongz commented Aug 17, 2016

@saviola777 Thanks for your quick reply. We are running darch 0.12.0 and the gputools are not installed.

Session info

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] doRNG_1.6 rngtools_1.2.4 pkgmaker_0.22 registry_0.3 foreach_1.4.3
[6] darch_0.12.0

darsh DNN commnd

darchModel_50_10_1<-darch(llrRankedTrain[,1:50],llrRankedTrain$target,layers=c(50,50,11,2),darch.unitFunction=exponentialLinearUnit)

Input data

The input data size is about 1,000,000 x 50

Is version 0.10 the non optimized version for single core?

@saviola777
Copy link
Collaborator

Thanks for the feedback. I think the C++ implementation of the unit functions (more specifically of the ELU) is to blame for the lack of multi-threading in this case. I will have to investigate how I can make use of multi-threading from within the C++ code, but I'm afraid that it's going to be non-trivial (also considering that I'm not very experienced when it comes to writing C++ code).

Version 0.10 does not include the C++ optimizations, but it lacks many of the new features (e.g., it does not support ELU) and contains a number of bugs and problems which were fixed in 0.10. You can of course add your own unit functions dynamically in 0.10 if you want.

There are two possible solutions to this problem:

  • introduce a switch for the C++ implementations of unit functions, which would be simple
  • make use of multi-threading from within the C++ code, which would probably be the best solution

I can't promise you an update with a fix on CRAN for a while, and I'm not sure when I'll get around to fix this, but I will try to implement the first solution within the next weeks so that you can check if the problem is solved by it.

@saviola777
Copy link
Collaborator

Just a couple of… weeks later, this should finally be fixed, I moved most C++ functions to RcppParallel, so you should see a significant speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants