# Fast Empirical Survival Function Estimates

A notebook demonstrating how to make survival function estimates faster, based on the wickedly fast performance of Ben Gorman's `empirical_cdf()` function in his `mltools` package. 

Basically, I just used the same source code as his function, but reverse the coordinate order in the grid we are evaluating on (since evaluating the survival function from the bottom left is just like evaluating the cdf from the top right). It's so much faster than what I was doing before, as demonstrated here.

In [1]:
library(data.table)
library(mltools)
source('~/isolines_uq/scripts/R/auxiliary_scripts/utils.R')

“package ‘data.table’ was built under R version 4.2.3”
“package ‘mltools’ was built under R version 4.2.3”
“package ‘dplyr’ was built under R version 4.2.3”

Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


“package ‘purrr’ was built under R version 4.2.3”

Attaching package: ‘purrr’


The following object is masked from ‘package:data.table’:

    transpose




In [21]:
lb <- -2
ub <- 7
gticks <- 400
grid <- expand.grid(X1 = seq(lb, ub, length.out=gticks), X2 = seq(lb, ub, length.out=gticks))

n <- 10000
dat <- data.frame(rmvt(n, sigma = matrix(c(1, 0.7, 0.7, 1), nrow = 2), df = 4))

In [22]:
start <- proc.time()
old_method <- apply(grid, 1, empSurv, dat=dat)
proc.time() - start

   user  system elapsed 
 21.316   0.212  21.539 

In [23]:
start <- proc.time()
new_method <- fastEmpSurv(grid, dat)
proc.time() - start

   user  system elapsed 
  0.196   0.000   0.110 

In [32]:
start <- proc.time()
grid_table <- data.table(grid)
dat_table <- data.table(dat)
ecdf <- empirical_cdf(dat_table, -grid_table)$CDF
proc.time() - start

   user  system elapsed 
  0.473   0.000   0.329 

In [33]:
all(old_method == new_method)

In [34]:
old_method

In [35]:
grid

X1,X2
<dbl>,<dbl>
-2.000000,-2
-1.977444,-2
-1.954887,-2
-1.932331,-2
-1.909774,-2
-1.887218,-2
-1.864662,-2
-1.842105,-2
-1.819549,-2
-1.796992,-2


In [47]:
empirical_cdf(dat_table, grid_table)

X1,X2,N.cum,CDF
<dbl>,<dbl>,<int>,<dbl>
-2.000000,-2,295,0.0295
-1.977444,-2,304,0.0304
-1.954887,-2,307,0.0307
-1.932331,-2,313,0.0313
-1.909774,-2,318,0.0318
-1.887218,-2,323,0.0323
-1.864662,-2,330,0.0330
-1.842105,-2,333,0.0333
-1.819549,-2,339,0.0339
-1.796992,-2,340,0.0340


In [48]:
mean(((dat)$X1 <= 7) & ((dat)$X2 <= 7))

It recovers the exact same functions from the old method, and does so about 200 times faster..

## Playing around with the various steps

In [65]:
dat <- data.table(dat)
# flip the grid points around
grid <- data.table(grid[nrow(grid):1,])

In [66]:
grid

X1,X2
<dbl>,<dbl>
5.000000,5
4.979920,5
4.959839,5
4.939759,5
4.919679,5
4.899598,5
4.879518,5
4.859438,5
4.839357,5
4.819277,5


In [47]:
dat_copy <- copy(dat[, names(grid), with=FALSE])

In [48]:
uboundDT <- unique(data.table(grid[['X1']], grid[['X1']]))
setnames(uboundDT, c('X1', paste0("Bound.X1")))
dat_copy <- uboundDT[dat_copy, on='X1', roll=Inf, nomatch=0]

In [49]:
uboundDT <- unique(data.table(grid[['X2']], grid[['X2']]))
setnames(uboundDT, c('X2', paste0("Bound.X2")))
dat_copy <- uboundDT[dat_copy, on='X2', roll=Inf, nomatch=0]

In [50]:
dat_copy

X2,Bound.X2,X1,Bound.X1
<dbl>,<dbl>,<dbl>,<dbl>
0.5680313,0.5622490,0.35909359,0.34136546
0.1947867,0.1807229,0.27291956,0.26104418
0.5806829,0.5622490,1.05803816,1.04417671
1.0849059,1.0843373,2.18433435,2.16867470
0.4719644,0.4618474,0.80956416,0.80321285
3.6341887,3.6144578,2.74495695,2.73092369
2.3033774,2.2891566,2.85367524,2.85140562
0.9261613,0.9236948,0.57335317,0.56224900
1.2382369,1.2248996,0.31351578,0.30120482
0.5416225,0.5220884,0.72156044,0.70281124


In [51]:
binned.uniques <- dat_copy[, .N, keyby=eval(paste0("Bound.", names(grid)))]

In [52]:
binned.uniques

Bound.X1,Bound.X2,N
<dbl>,<dbl>,<int>
0,0.00000000,2
0,0.02008032,1
0,0.04016064,3
0,0.06024096,2
0,0.08032129,1
0,0.14056225,1
0,0.16064257,3
0,0.18072289,4
0,0.20080321,3
0,0.22088353,2


In [53]:
setnames(binned.uniques, paste0("Bound.", names(grid)), names(grid))

In [54]:
binned.uniques

X1,X2,N
<dbl>,<dbl>,<int>
0,0.00000000,2
0,0.02008032,1
0,0.04016064,3
0,0.06024096,2
0,0.08032129,1
0,0.14056225,1
0,0.16064257,3
0,0.18072289,4
0,0.20080321,3
0,0.22088353,2


In [55]:
grid <- binned.uniques[grid, on=names(grid)]
grid[is.na(N), N := 0]

In [56]:
grid

X1,X2,N
<dbl>,<dbl>,<int>
5.000000,5,27
4.979920,5,0
4.959839,5,0
4.939759,5,2
4.919679,5,0
4.899598,5,0
4.879518,5,0
4.859438,5,0
4.839357,5,0
4.819277,5,0


In [57]:
grid[, N.cum := cumsum(N), by='X2']

In [59]:
grid[, N.cum := cumsum(N.cum), by='X1']

In [60]:
grid

X1,X2,N,N.cum
<dbl>,<dbl>,<int>,<int>
5.000000,5,27,27
4.979920,5,0,27
4.959839,5,0,27
4.939759,5,2,29
4.919679,5,0,29
4.899598,5,0,29
4.879518,5,0,29
4.859438,5,0,29
4.839357,5,0,29
4.819277,5,0,29


In [61]:
grid[, `:=`(N = NULL, Surv = N.cum/nrow(dat))]