"Growing trees... Killed" extreme high memory consumption for survival forests #202

thomasmooon · 2017-06-06T12:09:16Z

Hello,

thanks for providing the ranger package for fast RSF.
And thanks for your time reading this.

Depending on the value of the num.trees size, growing of the trees aborts suddenly, even though I have
two strong servers as described below.
There is no error message, but "killed" - please see attached screenshot.

Given num.trees=5000 the interruption occured at growing progress of e. g. 52%, 76%, 86%.
But never at an lower progress rate.
In another dataset I've observed this behaviour at 99%, too.
I've tried using the "dependent.variable" and "status.variable" notation instead of providing
a survival formula or survival object, but that didn't helped, too.

I'm running the R-Script in bash mode to avoid any overhead or pertubations from RStudio.

The different training datasets I tried have 27.100 Obserations and 500 features.
The ranger() function call is:
ranger(dependent.variable.name = "time", status.variable.name = "status", data = training, num.trees = num.trees, save.memory = T )

Whereas the call from the shell is e. g. R < run.Ranger.R --no-save

All independent variables are numeric and scale between [0,1].
A workaround is reducing num.tress with a try-and-error approach.
If using importance = TRUE I have to reduce the tree size further to avoid "killed" sessions.
Setting save.memory = TRUE` doesn't help.

I'd be glad and thankful for any ideas or proposals!

Finally, here are the hardware / software stats:

OS System

LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:g raphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 6.8 (Santiago)
Release: 6.8
Codename: Santiago

Hardware, two servers with each (problem occurs on both, so it's server independent):

CPU: 20 Cores, Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
RAM: 246 GB

R

platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.2
year 2016
month 10
day 31
svn rev 71607
language R
version.string R version 3.3.2 (2016-10-31)
nickname Sincere Pumpkin Patch

The text was updated successfully, but these errors were encountered:

thomasmooon · 2017-06-06T14:19:21Z

I'm doing some tests + memory monitoring and will provide them here as soon as possible.

thomasmooon · 2017-06-07T10:01:33Z

I found out that the memory report above is misleading due to aggregation.
The real consumption is much higher as I will show below.

I asked if the crash is caused through the dimension (27.100 x 500) or through the inner structure. If it would be the first, then simulated data of the same size should cause a crash, too. If else, the simulation would be successful.

So I simulated a uniform [0,1] distributed matrix of same dimension, trained a SRF and logged memory consumption. (Code below) Then I did the same with the real data. In both cases the same parameters were used.

Result:
The RAM consumption of the real dataset goes through the roof, but for the simulation is almost constant.
When the RAM consumptions goes to the limit, the uses swap memory, then terminates the process if
this goes to it's limits, too.

questions:

Is this a normal behaviour?
I'm wondering why the "simulation data" computation takes about 27min to finalize (100% progess), which is about 10min slower, than the "real data" (95% progress)? Why is it more expensive in computation, but cheaper in RAM?

code (simulation) data

(only data argument was replaced when using the real data)
`
library(survival)
library(ranger)
library(dplyr)

n <- 27100
M <- data.frame(matrix(runif(n*500, 0, 1), nrow = n))
M$time <- round(rbeta(n, 10, 3)*100,0)
M$status <- round(runif(n, 0, 1) + 0.2, 0)

ntrees <- 5000
ncores <- 15

srf <-
ranger(
data = M,
num.trees = ntrees,
num.threads = ncores,
dependent.variable.name = "time",
seed = 1,
save.memory = F,
status.variable.name = "status")
`

mnwright · 2017-06-08T12:11:20Z

Thanks. This is very strange. Any idea what could still be different in the real dataset? I guess you are not allowed to share it? Could you check the size of the resulting forest in the two cases, e.g., with
mean(sapply(srf$forest$split.varIDs, length))

By the way, save.memory = TRUE has no effect on survival forests, we should add a warning.

thomasmooon · 2017-06-08T13:01:02Z

Main differences in the datasets (real vs. simulated)

The real dataset has logical and continuous values as features. So columns exists with values {0,1} or in [0,1] respectively.
The simulation dataset has only feature values in [0,1] and not {0,1}.

I'll check the tree sizes as soon as my servers work fine, again. This is somehow offtopic, but...:

Yesterday I made some excessive run + kill tests, since then I have troubles.
Which means

I started ranger (with real data) with different parameters (e. g. mtry, trees, save.memory, playing with the data dimensions) and logged the x in "Growing trees .... x% Progress" + RAM demand as well.
So I retrieved a table such as below for different parameter sets.
Since I didn't wanted to wait until completion each time, I killed the parent process with SIGTERM
I did this dozens of times.

This had a bad effect:
Normally the simulation script (code below) runs up to ~ 25mins.
And the "Growing trees ..." progress message will be provided after 30s after starting at latest.
But now... I have to wait 2 hours and then ranger estimates a runtime of ~12-13 days.
My IT is quite busy trying to understand and fix that.

progress and memory demand example tables

simulation script (5-10 minute version, `ntrees <- 5000` costs ~25min)

`
library(ranger)

n <- 27144
nfeat <- 500
M <- data.frame(matrix(runif(n*nfeat, 0, 1), nrow = n))
M$time <- round(rbeta(n, 10, 3)*100,0)
M$status <- round((runif(n, 0, 1) + 0.2 ),0)

ntrees <- 1000
ncores <- 15

srf <-
ranger(
data = M,
num.trees = ntrees,
num.threads = ncores,
dependent.variable.name = "time",
status.variable.name = "status",
seed = 1
)
`

thomasmooon · 2017-07-03T09:17:34Z

Please excuse my late response.

mean(sapply(srf$forest$split.varIDs, length)) := nSplit.mean results in about 12K each (real + simulated data).
Increasing min.node.size enables regularization so that the problem is less expensive in terms of computing time, but still needs massive RAM.

The picture below shows a benchmark over min.node.size:
x-axis denotes the min.node.size
y-axis denotes time (=computation time in seconds) and nSplit.mean

But nonetheless the memory demand (not shown) for the real data remains massive.
The current workaround is: Increasing nSplit.mean so that a SRF with at least 1000 trees can be grown.
This is far beyond the desired 5000 trees, but unfortunately there's no time to research how to decrease the RAM hunger.

mnwright · 2017-07-06T16:31:55Z

The reason is probably that in a survival forest a cumulative hazard function has to be saved in each terminal node. If there are many unique time points in the dataset, these CHFs grow large and for many deep trees they are a lot of them.

To verify this, could you try to change the splitting rule to "extratrees" and/or "maxstat" and check if this changes the memory usage?

thomasmooon · 2017-07-07T13:17:56Z

900 - 1000 unqiue time points do exists. I'll check the alternative splitting rules as soon as my other computations have finished, this may take up to a month. I'll provide an update as soon as possible. Today I trained a SRF in the same manner as before (min.node.size 15, ntrees = 1000), but with more patients (62K; real data, no simulation data) than before. The patients utilized before are contained as a subset in these. Very surprisingly the memory demand dropped to ~ 50 GB, which is nice, but confusing.

khotilov · 2017-07-08T19:26:44Z

Approximating survival times to a restricted grid of time values can greatly improve the performance. 1000 time points is way too many. By the way, in randomForestSRC they have a parameter for facilitating that operation. I don't feel like such a parameter is absolutely needed (I prefer full control in defining the time grid myself), but it might be useful one to have.

XavierPrudent · 2019-05-14T22:14:21Z

Dear all,
I landed on this page looking for explanations on why the Ranger function skyrockets the memory usage. Is there anything from the user side that can be done to avoid/minimize such high usage?
Thank you.
Regards,
Xavier

thomasmooon · 2019-05-15T07:22:33Z

Hi @XavierPrudent, to point out a few things I did following @mnwright recommendations:

aggregated longitidunal representation, e.g. instead of eg ~1000 days (3y of history) -> 36 months.
increase the threshold for min.node.size from marvins comment above
reduced ntrees

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Growing trees... Killed" extreme high memory consumption for survival forests #202

"Growing trees... Killed" extreme high memory consumption for survival forests #202

thomasmooon commented Jun 6, 2017 •

edited

Loading

thomasmooon commented Jun 6, 2017

thomasmooon commented Jun 7, 2017 •

edited

Loading

mnwright commented Jun 8, 2017

thomasmooon commented Jun 8, 2017 •

edited

Loading

thomasmooon commented Jul 3, 2017 •

edited

Loading

mnwright commented Jul 6, 2017

thomasmooon commented Jul 7, 2017 via email •

edited

Loading

khotilov commented Jul 8, 2017

XavierPrudent commented May 14, 2019

thomasmooon commented May 15, 2019 •

edited

Loading

mnwright commented May 16, 2019

"Growing trees... Killed" extreme high memory consumption for survival forests #202

"Growing trees... Killed" extreme high memory consumption for survival forests #202

Comments

thomasmooon commented Jun 6, 2017 • edited Loading

Finally, here are the hardware / software stats:

OS System

Hardware, two servers with each (problem occurs on both, so it's server independent):

R

thomasmooon commented Jun 6, 2017

thomasmooon commented Jun 7, 2017 • edited Loading

questions:

code (simulation) data

mnwright commented Jun 8, 2017

thomasmooon commented Jun 8, 2017 • edited Loading

progress and memory demand example tables

simulation script (5-10 minute version, ntrees <- 5000 costs ~25min)

thomasmooon commented Jul 3, 2017 • edited Loading

mnwright commented Jul 6, 2017

thomasmooon commented Jul 7, 2017 via email • edited Loading

khotilov commented Jul 8, 2017

XavierPrudent commented May 14, 2019

thomasmooon commented May 15, 2019 • edited Loading

mnwright commented May 16, 2019

thomasmooon commented Jun 6, 2017 •

edited

Loading

thomasmooon commented Jun 7, 2017 •

edited

Loading

thomasmooon commented Jun 8, 2017 •

edited

Loading

simulation script (5-10 minute version, `ntrees <- 5000` costs ~25min)

thomasmooon commented Jul 3, 2017 •

edited

Loading

thomasmooon commented Jul 7, 2017 via email •

edited

Loading

thomasmooon commented May 15, 2019 •

edited

Loading