Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"case.weights" take very long #54

Closed
mayer79 opened this issue Apr 25, 2016 · 7 comments
Closed

"case.weights" take very long #54

mayer79 opened this issue Apr 25, 2016 · 7 comments

Comments

@mayer79
Copy link

mayer79 commented Apr 25, 2016

The factory fresh option of using case weights in drawing the bootstrap sample is very important in practice. However I recognized an explosion in runtime when using it. In below example, time consumption with case weights is about ten times as large as without. Is this as expected?

library(ranger)

n <- 10000

set.seed(4)
y <- rnorm(n)
x <- rnorm(n)
w <- runif(n)

# No case weights: User 9.96, System 0.04 on a 8 GB RAM windows laptop
system.time(fit.1 <- ranger(y ~ x)) 

# Uniform case weights: User 114.69, System 0.12
system.time(fit.2 <- ranger(y ~ x, case.weights = w)) 

# Equal case weights: User 112.36, System 0.11
system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 

@mnwright
Copy link
Member

No this is not as expected. I can reproduce the issue on Windows but not on Mac or Linux. I will check the code for some Windows-specific problems.

@mnwright
Copy link
Member

The problem seems to be std::discrete_distribution<> with gcc 4.6.3. I tried with the new 4.9.3 toolchain and R-devel and it was fast.

Any idea how to solve this instead of waiting for a newer gcc?

@khotilov
Copy link

Using boost::random::discrete_distribution as a replacement helps:
before:

> system.time(fit.1 <- ranger(y ~ x)) 
   user  system elapsed 
   9.27    0.13    9.41 
> system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 
   user  system elapsed 
  93.02    0.07   93.19 

after:

> system.time(fit.1 <- ranger(y ~ x)) 
   user  system elapsed 
   8.76    0.16    8.96 
> system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 
   user  system elapsed 
   8.98    0.09    9.09 

@mnwright
Copy link
Member

Thanks! However I'm reluctant to merge it in the master because of the Boost dependency... ;)

@khotilov
Copy link

That is a temporary simple solution while waiting for a newer gcc. I didn't do extensive testing, but a quick check showed very similar model performance (see below). That should make it at least feasible for me to run some prototyping with ranger on my windows laptop, as I frequently need to use weights. And the real dependency is only for the windows R version, which is already a neglected child with no multithreading :)

# with the original std::discrete_distribution
set.seed(111)
fit_std <- ranger(y ~ x, case.weights = rep(1, times = n), write.forest=T)
pr_std <- predict(fit_std, data.frame(x = x))

# with boost::random::discrete_distribution
set.seed(111)
fit_boost <- ranger(y ~ x, case.weights = rep(1, times = n), write.forest=T)
pr_boost <- predict(fit_boost, data.frame(x = x))

cor(pr_std$predictions, pr_boost$predictions)
[1] 0.9979446

The gcc's <random> was based on boost. But some over-engineering resulted in overheads and worse speed - I've seen a few discussions about that in the past. It wasn't just the discrete_distribution, but some other distributions too were several times slower. Maybe things did significantly improve in this regard in the latest releases (I didn't really follow), but I personally had more trust in boost::random.

It's your choice in the end. I'm just telling you what I know. I'm glad I've noticed this discussion, since my initial observations didn't agree with the claims of ranger being very fast, so I didn't even try it on a linux server.

@mnwright
Copy link
Member

mnwright commented May 2, 2016

I just released a version (0.4.2) based on the new toolchain. As reported, the problem is solved there. In addition, multithreading is finally working! This version can also be installed on the current R version by using the binary, see https://github.com/imbs-hl/ranger/releases.

I hope it's solved with R-3.3.0!

@mayer79
Copy link
Author

mayer79 commented May 2, 2016

This is brilliant, thank you very much for these investigations. Even on the current R version, the issue seems to be fixed with ranger 0.4.2. Wow!

@mayer79 mayer79 closed this as completed May 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants