Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing future.apply::future_*apply for massively large model via remote parallelisation #136

Closed
wants to merge 7 commits into from

Conversation

seonghobae
Copy link
Contributor

Implementing the "future_*apply" API for massively large model via remote parallelisation.

  • Purpose: make the faster model calibration, and evaluation for some functions what using the myLapply() and mySapply() internally especially model is too massive. (myApply() will update soon.)
  • The itemfit(), mdirt(), DIF(), DTF(), M2(), PLCI.mirt(), lagrange(), boot.LR(), and https://github.com/philchalmers/mirt/blob/master/R/03-estimation.R#L729-L740 may run faster or work memory efficient when working with the massively big model with remote clusters.
  • If future::future() supports the MPI interface in the someday (see How to implement resolved() for an MPI-based cluster? HenrikBengtsson/future#130), The mirt() may run on the supercomputer cluster on apply() related functions.
  • Speed depends on Network bandwidth, but future() will be run in the multiprocess manner within each remote workers, detecting the available number of cores automatically for the heterogeneous cluster.
  • This pull request may be useful to researchers who are using Virtual Private Servers (VPS) to calibrate models on Amazon Web Services, DigitalOcean, Vultr, and so on.

Demo run

# local (One core, Xeon E5-2660, using MKL)
> system.time(mod_local <- mirt::mirt(mirt::Science, 4, SE = T, method = 'MHRM'))
Stage 3 = 311, LL = -3333.5, AR(1.10) = [0.21], gam = 0.0024, Max-Change = 0.0005

Calculating information matrix...

Calculating log-likelihood...
 User  System elapsed 
 32.897   0.318  26.639 

# local (4 cores, Xeon E5-2660, using MKL)
> suppressWarnings(suppressMessages(mirt::mirtCluster()))
> system.time(mod_local_parallel <- mirt::mirt(mirt::Science, 4, SE = T, method = 'MHRM'))
Stage 3 = 311, LL = -3333.5, AR(1.10) = [0.21], gam = 0.0024, Max-Change = 0.0005

Calculating information matrix...

Calculating log-likelihood...
 User  System elapsed 
 26.404   0.310  22.438 
> suppressWarnings(mirt::mirtCluster(remove = TRUE))

# rely on the heterogenious remotes via SSH, localhost will not use memory in log-likelihood calculation.
> getOption('kaefaServers')
 [1] "mpiuser@s1"  "mpiuser@s2"  "mpiuser@s3"  "mpiuser@s4"  "mpiuser@s5"  "mpiuser@s6"  "mpiuser@s7" 
 [8] "mpiuser@s8"  "mpiuser@s9"  "mpiuser@s10" "mpiuser@s11" "mpiuser@b1"  "mpiuser@b2"  "mpiuser@b3" 
[15] "mpiuser@b4" 
> suppressWarnings(suppressMessages(mirt::mirtCluster(getOption('kaefaServers')))) # using parallel::par*apply
> system.time(mod_remote_parallel_traditional <- mirt::mirt(mirt::Science, 4, SE = T, method = 'MHRM'))
Stage 3 = 311, LL = -3333.5, AR(1.10) = [0.21], gam = 0.0024, Max-Change = 0.0005

Calculating information matrix...

Calculating log-likelihood...
 User  System elapsed 
 26.801   0.996  51.924 
> suppressWarnings(mirt::mirtCluster(remove = TRUE))
> suppressWarnings(suppressMessages(mirt::mirtCluster(getOption('kaefaServers'), use_future = TRUE))) # using future API
> system.time(mod_remote_parallel_futureapi <- mirt::mirt(mirt::Science, 4, SE = T, method = 'MHRM'))
Stage 3 = 311, LL = -3333.5, AR(1.10) = [0.21], gam = 0.0024, Max-Change = 0.0005

Calculating information matrix...

Calculating log-likelihood...
 User  System elapsed 
 27.305   0.888  44.218 
  • The demo seems slower in remotes (than local parallelism) in 100Mbps bandwidth, but the future API was increasing calibration speeds than parallel about 7.706 seconds. This difference occurs from multiprocess strategy in each connection by automatic detection of the number of cores. If expanding bandwidth, speed will increase more. (e.g. Infiniband: https://en.wikipedia.org/wiki/InfiniBand)
  • If I can implement future_apply() someday, remote calibration works will faster than now.

Please check my request feel free.

Best,
Seongho

@philchalmers
Copy link
Owner

I'll consider this, but the case to make the merge doesn't seem that great in that it's almost overkill. The parallel processing may work better over the future framework, but I don't really see the benefit over the current parallel package scheme, even with physical Infiniband support. I think a case needs to be made that it will actually improve the performance, otherwise maintaining such a codebase isn't really in my interest.

@seonghobae
Copy link
Contributor Author

Yes, This commit may head up to overkill the speed up. Implementing after future_*apply functions, I may commit parallelised C++ codes (RcppParallel) and GPU matrix calculations (gpuR) in the parallelised cluster, the heterogeneous computing environment. This commit is just starting point with the performance improvements. I understand the calibration speed issue around mirt() a long time. I'll keep updating this commits steady. I want to listen to your opinion.

Future Plan

  • Parallelising the calculations among inter computers using future API with the automatic load balancing. (Current)
  • Parallelising the C++ codes within the machine for base R environment, where Intel MKL is not available.
  • Apply the GPU matrix estimations in some part if GPU detected.

※ I'm a user of https://www.top500.org/system/177987 now.

@seonghobae
Copy link
Contributor Author

I expect the future_*apply will make a way to parallelise the E step and M step inter- and intra-machines with refactoring codes. That's why I commit the future_*apply first with head up.

(This work may require modifying some calibration part codes even the gpuR is the wrapper of OpenCL support in R environment!)
@seonghobae
Copy link
Contributor Author

seonghobae commented Feb 21, 2018

I want to implement OpenCL support via future_*apply: less effort but maximising efficiency and speed. So I made a placeholder for OpenCL support via mirtCluster().

The Short documentation of OpenCL support in R:
https://secure.hosting.vt.edu/www.arc.vt.edu/wp-content/uploads/2017/04/GPU_R_Workshop_2017_slidy.html
https://rpubs.com/christoph_euler/gpuR_examples
http://bobby.gramacy.com/teaching/asc/gpu_tutorial.html

@philchalmers
Copy link
Owner

Parallelizing the E and M-steps would take a fairly large amount of code rewriting. As it stands, the E-step could be rewritten partially to use OpenMP, but it's computations are not strictly parallel (at least not embarrassingly) because the expectation table must be shared across nodes, causing write-permission conflicts. I experimented with this idea a while back, but gave up on it when the performance was not working due to all the #pragma flags. The M-step in its current form also could never be run in parallel, nor could the derivative computations required, simply because R's object types are not typesafe.

These are, what some would consider, the major limitation of mirt, which I will never change. If I get sabbatical one year, maybe a performance inspired mirt2 could be created instead....but it couldn't have all the features that are only possible because of the R interpretive interface.

@seonghobae
Copy link
Contributor Author

Yes, I saw the parts what you noted. I agree with you. In the current, it seems too hard to improve C++ level with refactoring. I already know #pragma may not useful in some situations and will not work well. Also outside of the C++ parts, some function requires sequential derivative computations in for() loops when I check. However, If I can implement future API during this work, I found I may parallelise some for() loops where R codes. Moreover, the GPU wrapper of solve(), crossprod(), and so on may help to improve calibration speed instantly. (I know that is not the C++ level so that may not be helpful to Windows users.) I'm reading code more and then will update.

@seonghobae
Copy link
Contributor Author

How about implementing this in the C++ level? http://viennacl.sourceforge.net/viennacl-about.html
The R wrapper function of ViennaCL is available.

@philchalmers
Copy link
Owner

I'm going to close this for now as I just don't see the benefit to it in this package. Something like this should be applied to a fork of mirt though, or better yet a complete re-write of the package intended for performance (mirt2 maybe?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants