ntime and runtime #406

sibipx · 2023-11-30T12:42:50Z

sibipx
Nov 30, 2023

To speed up calculations, and as I am not interested in predictions at all event times in the dataset, I specify ntime for both survival and competing risks models. (ntime = 1:7)

I would expect that the logrank split statistic is calculated by summing over these 7 ntimes (or closeby proxys from the provided eventimes).
I would expect it to reduce the runtime as in the situation when I provide discrete eventimes in the dataset (1,2,3,4,5,6,7).

There is though a big difference between the two situations. The difference is larger for competing risks.

As I am working with large datasets and running a lot of models, working only with ntime (without rounding the eventtimes with ceiling) is not feasible at all for me.

Of course, I might have wrong expectations about what ntime does; my expectation that it would run as fast as in the situation of discrete times might be wrong.
I would appreciate an explanation.

See below an example with the runtime on my Windows machine in comments.

Thank you!

library(randomForestSRC)

n <- 10000 
p <- 20

set.seed(2023)
x <- replicate(p, rnorm(n))
time <- runif(n, 0, 100)

status_CR <- round(runif(n , 1, 2)) # no censoring
status_surv <- ifelse(status_CR == 1, 1, 0) # censor other events than 1 at event time

data_CR <- data.frame(time = time, status = status_CR, x)
data_CR_round <- data.frame(time = ceiling(time), status = status_CR, x)
data_surv <- data.frame(time = time, status = status_surv, x)
data_surv_round <- data.frame(time = ceiling(time), status = status_surv, x)

# competing risks
start_time <- Sys.time()
RF_model_CR <- rfsrc(Surv(time, status) ~ ., data_CR, 
                     samptype = "swr",
                     nodesize = 100,
                     mtry = 10,
                     ntree = 500,
                     save.memory = TRUE, 
                     do.trace = FALSE,
                     importance = "none",
                     perf.type = "none",
                     ntime = 1:7, 
                     cause = 1,
                     splitrule="logrank")
message(sprintf("DONE in %s minutes.", 
                as.numeric(difftime(Sys.time(), start_time, units = "mins"))))
# DONE in 6.73131632010142 minutes.

# competing risks with rounded times 
start_time <- Sys.time()
RF_model_CR <- rfsrc(Surv(time, status) ~ ., data_CR_round, 
                     samptype = "swr",
                     nodesize = 100,
                     mtry = 10,
                     ntree = 500,
                     save.memory = TRUE, 
                     do.trace = FALSE,
                     importance = "none",
                     perf.type = "none",
                     ntime = 1:7, 
                     cause = 1,
                     splitrule="logrank")
message(sprintf("DONE in %s minutes.", 
                as.numeric(difftime(Sys.time(), start_time, units = "mins"))))
# DONE in 0.477894465128581 minutes.

# survival
start_time <- Sys.time()
RF_model_surv <- rfsrc(Surv(time, status) ~ ., data_surv, 
                       samptype = "swr",
                       nodesize = 100,
                       mtry = 10,
                       ntree = 500,
                       save.memory = TRUE, 
                       do.trace = FALSE,
                       importance = "none",
                       perf.type = "none",
                       ntime = 1:7)
message(sprintf("DONE in %s minutes.", 
                as.numeric(difftime(Sys.time(), start_time, units = "mins"))))
# DONE in 2.11460233132044 minutes.

# survival with round timepoints
start_time <- Sys.time()
RF_model_surv_round <- rfsrc(Surv(time, status) ~ ., data_surv_round, 
                             samptype = "swr",
                             nodesize = 100,
                             mtry = 10,
                             ntree = 500,
                             save.memory = TRUE, 
                             do.trace = FALSE,
                             importance = "none",
                             perf.type = "none",
                             ntime = 1:7)
message(sprintf("DONE in %s minutes.", 
                as.numeric(difftime(Sys.time(), start_time, units = "mins"))))
# DONE in 0.373596449693044 minutes.

sibipx · 2023-12-01T10:43:03Z

sibipx
Dec 1, 2023
Author

Small update: not only in terms of computational performance, but also in terms on functional behaviour, ntime does not seem to make a difference when passed as a vector of discrete values. It looks like it is parsed correctly in the R code but it seems that the trees are grown on all times (I get the exact same splits and the exact same predictions no matter how I set ntime).

I think I can get the behaviour I want (and a decent computational time) by manually manipulating the eventtime (ceiling) and manually applying administrative censoring in the data), but I would appreciate a clarification on the expected behaviour of ntime. Thanks!

data(pbc, package = "randomForestSRC")
pbc <- pbc[,c("days", "status", "copper")]
pbc <- na.omit(pbc)

# manipulate the event times to simplify matters
pbc$days <- ceiling(pbc$days/100) * 100
sort(unique(pbc$days))

# survival ntime 100 to 700
set.seed(2023)
RF_model_surv_7 <- rfsrc(Surv(days, status) ~ ., pbc, 
                         ntree = 1,
                         nodesize = 30,
                         ntime = (1:7) * 100)

plot(get.tree(RF_model_surv_7, 1))
RF_model_surv_7$time.interest

# survival ntime 100 to 3000
set.seed(2023)
RF_model_surv_30 <- rfsrc(Surv(days, status) ~ ., pbc, 
                          ntree = 1,
                          nodesize = 30,
                          ntime = (1:30) * 100)

plot(get.tree(RF_model_surv_30, 1))
RF_model_surv_30$time.interest

identical(RF_model_surv_7$survival.oob[,1:7], 
          RF_model_surv_30$survival.oob[,1:7])

# survival with no timepoints
set.seed(2023)
RF_model_surv_notime <- rfsrc(Surv(days, status) ~ ., pbc, 
                          ntree = 1,
                          nodesize = 30)

plot(get.tree(RF_model_surv_notime, 1))
RF_model_surv_notime$time.interest

identical(RF_model_surv_7$survival.oob[,1:7], 
          RF_model_surv_notime$survival.oob[,1:7])

# adm censoring at 7 
pbc_admcens <- pbc
pbc_admcens$status <- ifelse(pbc_admcens$days > 700, 0, pbc_admcens$status)
set.seed(2023)
RF_model_surv_7_admcens <- rfsrc(Surv(days, status) ~ ., pbc_admcens, 
                                 ntree = 1,
                                 nodesize = 30,
                                 ntime = (1:7) * 100)

plot(get.tree(RF_model_surv_7_admcens, 1))
RF_model_surv_7_admcens$time.interest

identical(RF_model_surv_7$survival.oob[,1:7], 
          RF_model_surv_7_admcens$survival.oob[,1:7])

2 replies

ishwaran Dec 11, 2023
Collaborator

ntime does not affect the way that the split-statistic is calculated, it is only used to reduce the computations for the terminal node estimators (survival function, CHF, etc.) and the subsequent ensemble estimators. The nsplit option is used for reducing the computational time for split-statistics.

Also if ntime is passed as a vector of time values (as you are doing) then these are adjusted to match the closest observed event times. This is because the survival function only changes at the event times so there is no point in evaluating any other time values.

To reduce computational times in big data settings, see the following vignette

In particular, increasing nodesize and turning of C-calculations can have a big impact.

sibipx Dec 12, 2023
Author

Thank you for the answer! It seems I had wrong expectations for ntime. What I wanted is really different models (different split statistics) in function of the times of interest. I can still get what I want by rounding the event times with ceiling for computational performance and administrative censoring for building against times of interest. I think I have applied already all possible methods for reducing computation time in the first example.

Thanks again for a package with so much functionality for the user!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ntime and runtime #406

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

ntime and runtime #406

sibipx Nov 30, 2023

Replies: 1 comment · 2 replies

sibipx Dec 1, 2023 Author

ishwaran Dec 11, 2023 Collaborator

sibipx Dec 12, 2023 Author

sibipx
Nov 30, 2023

Replies: 1 comment 2 replies

sibipx
Dec 1, 2023
Author

ishwaran Dec 11, 2023
Collaborator

sibipx Dec 12, 2023
Author