Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Growing trees... Killed" extreme high memory consumption for survival forests #202

Closed
thomasmooon opened this issue Jun 6, 2017 · 11 comments

Comments

@thomasmooon
Copy link

thomasmooon commented Jun 6, 2017

Hello,

thanks for providing the ranger package for fast RSF.
And thanks for your time reading this.

Depending on the value of the num.trees size, growing of the trees aborts suddenly, even though I have
two strong servers as described below.
There is no error message, but "killed" - please see attached screenshot.

Given num.trees=5000 the interruption occured at growing progress of e. g. 52%, 76%, 86%.
But never at an lower progress rate.
In another dataset I've observed this behaviour at 99%, too.
I've tried using the "dependent.variable" and "status.variable" notation instead of providing
a survival formula or survival object, but that didn't helped, too.

I'm running the R-Script in bash mode to avoid any overhead or pertubations from RStudio.
killed

The different training datasets I tried have 27.100 Obserations and 500 features.
The ranger() function call is:
ranger(dependent.variable.name = "time", status.variable.name = "status", data = training, num.trees = num.trees, save.memory = T )

Whereas the call from the shell is e. g. R < run.Ranger.R --no-save

All independent variables are numeric and scale between [0,1].
A workaround is reducing num.tress with a try-and-error approach.
If using importance = TRUE I have to reduce the tree size further to avoid "killed" sessions.
Setting save.memory = TRUE` doesn't help.

I'd be glad and thankful for any ideas or proposals!

Finally, here are the hardware / software stats:

OS System

LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:g raphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 6.8 (Santiago)
Release: 6.8
Codename: Santiago

Hardware, two servers with each (problem occurs on both, so it's server independent):

CPU: 20 Cores, Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
RAM: 246 GB

R

platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.2
year 2016
month 10
day 31
svn rev 71607
language R
version.string R version 3.3.2 (2016-10-31)
nickname Sincere Pumpkin Patch

@thomasmooon thomasmooon changed the title "Growing trees... Killed" depending on num.trees and data size "Growing trees... Killed" depending on num.trees and data size although only 40% RAM usage Jun 6, 2017
@thomasmooon
Copy link
Author

I'm doing some tests + memory monitoring and will provide them here as soon as possible.

@thomasmooon
Copy link
Author

thomasmooon commented Jun 7, 2017

I found out that the memory report above is misleading due to aggregation.
The real consumption is much higher as I will show below.

I asked if the crash is caused through the dimension (27.100 x 500) or through the inner structure. If it would be the first, then simulated data of the same size should cause a crash, too. If else, the simulation would be successful.

So I simulated a uniform [0,1] distributed matrix of same dimension, trained a SRF and logged memory consumption. (Code below) Then I did the same with the real data. In both cases the same parameters were used.

Result:
The RAM consumption of the real dataset goes through the roof, but for the simulation is almost constant.
When the RAM consumptions goes to the limit, the uses swap memory, then terminates the process if
this goes to it's limits, too.
ram_consumption_simvsreal_27100x500

questions:

Is this a normal behaviour?
I'm wondering why the "simulation data" computation takes about 27min to finalize (100% progess), which is about 10min slower, than the "real data" (95% progress)? Why is it more expensive in computation, but cheaper in RAM?

code (simulation) data

(only data argument was replaced when using the real data)
`
library(survival)
library(ranger)
library(dplyr)

n <- 27100
M <- data.frame(matrix(runif(n*500, 0, 1), nrow = n))
M$time <- round(rbeta(n, 10, 3)*100,0)
M$status <- round(runif(n, 0, 1) + 0.2, 0)

ntrees <- 5000
ncores <- 15

srf <-
ranger(
data = M,
num.trees = ntrees,
num.threads = ncores,
dependent.variable.name = "time",
seed = 1,
save.memory = F,
status.variable.name = "status")
`

@thomasmooon thomasmooon changed the title "Growing trees... Killed" depending on num.trees and data size although only 40% RAM usage "Growing trees... Killed" extrem high memory consumption Jun 7, 2017
@thomasmooon thomasmooon changed the title "Growing trees... Killed" extrem high memory consumption "Growing trees... Killed" extreme high memory consumption Jun 7, 2017
@mnwright
Copy link
Member

mnwright commented Jun 8, 2017

Thanks. This is very strange. Any idea what could still be different in the real dataset? I guess you are not allowed to share it? Could you check the size of the resulting forest in the two cases, e.g., with
mean(sapply(srf$forest$split.varIDs, length))

By the way, save.memory = TRUE has no effect on survival forests, we should add a warning.

@thomasmooon
Copy link
Author

thomasmooon commented Jun 8, 2017

Main differences in the datasets (real vs. simulated)

  • The real dataset has logical and continuous values as features. So columns exists with values {0,1} or in [0,1] respectively.
  • The simulation dataset has only feature values in [0,1] and not {0,1}.

I'll check the tree sizes as soon as my servers work fine, again. This is somehow offtopic, but...:

Yesterday I made some excessive run + kill tests, since then I have troubles.
Which means

  1. I started ranger (with real data) with different parameters (e. g. mtry, trees, save.memory, playing with the data dimensions) and logged the x in "Growing trees .... x% Progress" + RAM demand as well.
    So I retrieved a table such as below for different parameter sets.
  2. Since I didn't wanted to wait until completion each time, I killed the parent process with SIGTERM
    I did this dozens of times.

This had a bad effect:
Normally the simulation script (code below) runs up to ~ 25mins.
And the "Growing trees ..." progress message will be provided after 30s after starting at latest.
But now... I have to wait 2 hours and then ranger estimates a runtime of ~12-13 days.
My IT is quite busy trying to understand and fix that.

progress and memory demand example tables

image

simulation script (5-10 minute version, ntrees <- 5000 costs ~25min)

`
library(ranger)

n <- 27144
nfeat <- 500
M <- data.frame(matrix(runif(n*nfeat, 0, 1), nrow = n))
M$time <- round(rbeta(n, 10, 3)*100,0)
M$status <- round((runif(n, 0, 1) + 0.2 ),0)

ntrees <- 1000
ncores <- 15

srf <-
ranger(
data = M,
num.trees = ntrees,
num.threads = ncores,
dependent.variable.name = "time",
status.variable.name = "status",
seed = 1
)
`

@mnwright mnwright changed the title "Growing trees... Killed" extreme high memory consumption "Growing trees... Killed" extreme high memory consumption for survival forests Jun 12, 2017
@thomasmooon
Copy link
Author

thomasmooon commented Jul 3, 2017

Please excuse my late response.

mean(sapply(srf$forest$split.varIDs, length)) := nSplit.mean results in about 12K each (real + simulated data).
Increasing min.node.size enables regularization so that the problem is less expensive in terms of computing time, but still needs massive RAM.

The picture below shows a benchmark over min.node.size:
x-axis denotes the min.node.size
y-axis denotes time (=computation time in seconds) and nSplit.mean

But nonetheless the memory demand (not shown) for the real data remains massive.
The current workaround is: Increasing nSplit.mean so that a SRF with at least 1000 trees can be grown.
This is far beyond the desired 5000 trees, but unfortunately there's no time to research how to decrease the RAM hunger.

image

@mnwright
Copy link
Member

mnwright commented Jul 6, 2017

The reason is probably that in a survival forest a cumulative hazard function has to be saved in each terminal node. If there are many unique time points in the dataset, these CHFs grow large and for many deep trees they are a lot of them.

To verify this, could you try to change the splitting rule to "extratrees" and/or "maxstat" and check if this changes the memory usage?

@thomasmooon
Copy link
Author

thomasmooon commented Jul 7, 2017 via email

@khotilov
Copy link

khotilov commented Jul 8, 2017

Approximating survival times to a restricted grid of time values can greatly improve the performance. 1000 time points is way too many. By the way, in randomForestSRC they have a parameter for facilitating that operation. I don't feel like such a parameter is absolutely needed (I prefer full control in defining the time grid myself), but it might be useful one to have.

@XavierPrudent
Copy link

Dear all,
I landed on this page looking for explanations on why the Ranger function skyrockets the memory usage. Is there anything from the user side that can be done to avoid/minimize such high usage?
Thank you.
Regards,
Xavier

@thomasmooon
Copy link
Author

thomasmooon commented May 15, 2019

Hi @XavierPrudent, to point out a few things I did following @mnwright recommendations:

  • aggregated longitidunal representation, e.g. instead of eg ~1000 days (3y of history) -> 36 months.
  • increase the threshold for min.node.size from marvins comment above
  • reduced ntrees

More a workaround than a solution

  • train multiple smaller RF's whatever fits in RAM, then (a) use an ensemble of RF's or (b) combine them with caution

Beside of that I also played around with other RF implementations. As far as I know, ranger is still the most efficient implementation in R for (survival) random forest.

@mnwright
Copy link
Member

@XavierPrudent Please give some details (best with reproducible example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants