/
neuralnetworks.qmd
294 lines (235 loc) · 37 KB
/
neuralnetworks.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
abstract: TODO (150-200 WORDS)
---
{{< include _setup.qmd >}}
# Neural Networks
{{< include _wip.qmd >}}
## Neural Networks {#sec-surv-ml-models-nn}
Before starting the survey on neural networks, first a comment about their transparency and accessibility. Neural networks are infamously difficult to interpret and train, with some calling building and training neural networks an 'art' [@Hastie2001]. As discussed in the introduction of this book, whilst neural networks are not transparent with respect to their predictions, they are transparent with respect to implementation. In fact the simplest form of neural network, as seen below, is no more complex than a simple linear model. With regard to accessibility, whilst it is true that defining a custom neural network architecture is complex and highly subjective, established models are implemented with a default architecture and are therefore accessible 'off-shelf'.
### Neural Networks for Regression
(Artificial) Neural networks (ANNs) are a class of model that fall within the greater paradigm of *deep learning*. The simplest form of ANN, a feed-forward single-hidden-layer network, is a relatively simple algorithm that relies on linear models, basic activation functions, and simple derivatives. A short introduction to feed-forward regression ANNs is provided to motivate the survival models. This focuses on single-hidden-layer models and increasing this to multiple hidden layers follows relatively simply.
The single hidden-layer network is defined through three equations
\begin{align}
& Z_m = \sigma(\alpha_{0m} + \alpha^T_mX_i), \quad m = 1,...,M \\
& T = \beta_{0k} + \beta_k^TZ, \quad k = 1,..,K \\
& g_k(X_i) = \phi_k(T)
\end{align}
where $(X_1,...,X_n) \iid X$ are the usual training data, $\alpha_{0m}, \beta_0$ are bias parameters, and $\theta = \{\alpha_m, \beta\}$ ($m = 1,..,,M$) are model weights where $M$ is the number of hidden units. $K$ is the number of classes in the output, which for regression is usually $K = 1$. The function $\phi$ is a 'link' or 'activation function', which transforms the predictions in order to provide an outcome of the correct return type; usually in regression, $\phi(x) = x$. $\sigma$ is the 'activation function', which transforms outputs from each layer. The $\alpha_m$ parameters are often referred to as 'activations'. Different activation functions may be used in each layer or the same used throughout, the choice is down to expert knowledge. Common activation functions seen in this section include the sigmoid function,
$$
\sigma(v) = (1 + \exp(-v))^{-1}
$$
tanh function,
$$
\sigma(v) = \frac{\exp(v) - \exp(-v)}{\exp(v) + \exp(-v)}
$$ {#eq-surv-tanh}
and ReLU [@Nair2010]
$$
\sigma(v) = \max(0, v)
$$ {#eq-surv-relu}
A single-hidden-layer model can also be expressed in a single equation, which highlights the relative simplicity of what may appear a complex algorithm.
$$
g_k(X_i) = \sigma_0(\beta_{k0} + \sum_{h=1}^H (\beta_{kh}\sigma_h (\beta_{h0} + \sum^M_{m=1} \beta_{hm}X_{i;m}))
$$ {#eq-surv-nnet}
where $H$ are the number of hidden units, $\beta$ are the model weights, $\sigma_h$ is the activation function in unit $h$, also $\sigma_0$ is the output unit activation, and $X_{i;m}$ is the $i$th observation features in the $m$th hidden unit.
An example feed-forward single-hidden-layer regression ANN is displayed in (@fig-surv-ann). This model has 10 input units, 13 hidden units, and one output unit; two bias parameters are fit. The model is described as 'feed-forward' as there are no cycles in the node and information is passed forward from the input nodes (left) to the output node (right).
![Single-hidden-layer artificial neural network with 13 hidden units fit on the `mtcars` [@datamtcars] dataset using the $\pkg{nnet}$ [@pkgnnet] package, and \pkg{gamlss.add} [@pkggamlssadd] for plotting. Left column are input variables, I1-I10, second column are 13 hidden units, H1-H13, right column is single output variable, O1. B1 and B2 are bias parameters.](Figures/neuralnetworks/ann.png){#fig-surv-ann fig-alt="TODO"}
#### Back-Propagation {.unnumbered .unlisted}
The model weights, $\theta$, in this section are commonly fit by 'back-propagation' although this method is often considered inefficient compared to more recent advances. A brief pseudo-algorithm for the process is provided below.
Let $L$ be a chosen loss function for model fitting, let $\theta = (\alpha, \beta)$ be model weights, and let $J \in \PNaturals$ be the number of iterations to train the model over. Then the back-propagation method is given by,
* **For** $j = 1,...,J$:
*[] *Forward Pass*
*[i.] Fix weights $\theta^{(j-1)}$.
*[ii.] Compute predictions $\hat{Y} := \hatg^{(j)}_k(X_i|\theta^{(j-1)})$ with (@eq-surv-nnet).
*[] *Backward Pass*
*[iii.] Calculate the gradients of the loss $L(\hatY|\dtrain)$.
*[] *Update*
*[iv.] Update $\alpha^{(r)}, \beta^{(r)}$ with gradient descent.
* **End For**
In regression, a common choice for $L$ is the squared loss,
$$
L(\hatg, \theta|\dtrain) = \sum_{i=1}^n (Y_i - \hatg(X_i|\theta))^2
$$
which may help illustrate how the training outcome, $(Y_1,...,Y_n) \iid Y$, is utilised for model fitting.
#### Making Predictions {.unnumbered .unlisted}
Once the model is fitted, predictions for new data follow by passing the testing data as inputs to the model with fitted weights,
$$
g_k(X^*) = \sigma_0(\hat{\beta}_{k0} + \sum_{h=1}^H (\hat{\beta}_{kh}\sigma_h (\hat{\beta}_{h0} + \sum^M_{m=1} \hat{\beta}_{hm}X^*_m))
$$
#### Hyper-Parameters {.unnumbered .unlisted}
In practice, a regularization parameter, $\lambda$, is usually added to the loss function in order to help avoid overfitting. This parameter has the effect of shrinking model weights towards zero and hence in the context of ANNs regularization is usually referred to as 'weight decay'. The value of $\lambda$ is one of three important hyper-parameters in all ANNs, the other two are: the range of values to simulate initial weights from, and the number of hidden units, $M$.
The range of values for initial weights is usually not tuned but instead a consistent range is specified and the neural network is trained multiple times to account for randomness in initialization.
The regularization parameter and number of hidden units, $M$, depend on each other and have a similar relationship to the learning rate and number of iterations in the GBMs (@sec-surv-ml-models-boost). Like the GBMs, it is simplest to set a high number of hidden units and then tune the regularization parameter [@Bishop2006; @Hastie2001]. Determining how many hidden layers to include, and how to connect them, is informed by expert knowledge and well beyond the scope of this book; decades of research has been required to derive sensible new configurations.
#### Training Batches {.unnumbered .unlisted}
ANNs can either be trained using complete data, in batches, or online. This decision is usually data-driven and will affect the maximum number of iterations used to train the algorithm; as such this will also often be chosen by expert-knowledge and not empirical methods such as cross-validation.
#### Neural Terminology {.unnumbered .unlisted}
Neural network terminology often reflects the structures of the brain. Therefore ANN units are referred to as nodes or neurons and sometimes the connections between neurons are referred to as synapses. Neurons are said to be 'fired' if they are 'activated'. The simplest example of activating a neuron is with the Heaviside activation function with a threshold of $0$: $\sigma(v) = \II(v \geq 0)$. Then a node is activated and passes its output to the next layer if its value is positive, otherwise it contributes no value to the next layer.
### Neural Networks for Survival Analysis
Surveying neural networks is a non-trivial task as there has been a long history in machine learning of publishing very specific data-driven neural networks with limited applications; this is also true in survival analysis. This does mean however that where limited developments for survival were made in other machine learning classes, ANN survival adaptations have been around for several decades. A review in 2000 by Schwarzer $\etal$ surveyed 43 ANNs for diagnosis and prognosis published in the first half of the 90s, however only up to ten of these are specifically for survival data.\footnote{Schwarzer conflates the prognosis and survival task, therefore it is not clear if all 10 of these are for time-to-event data (at least five definitely are).} Of those, Schwarzer $\etal$ deemed three to be 'na\"ive applications to survival data', and recommended for future research models developed by Liest\o l $\etal$ (1994) [@Liestol1994], Faraggi and Simon (1995) [@Faraggi1995], and Biganzoli $\etal$ (1998) [@Biganzoli1998].
This survey will not be as comprehensive as the 2000 survey, and nor has any survey since, although there have been several ANN reviews [@Ripley2001; @Huang2020a; @Ohno-Machado1996; @Yang2010;@Zhu2020]. ANNs are considered to be a black-box model, with interpretability decreasing steeply as the number of hidden layers and nodes increases. In terms of accessibility there have been relatively few open-source packages developed for survival ANNs; where these are available the focus has historically been in Python, with no $\Rstats$ implementations. The new $\pkg{survivalmodels}$ [@pkgsurvivalmodels] package,\footnote{Created in order to run the experiments in [@Sonabend2021b].} implements these Python models via $\pkg{reticulate}$ [@pkgreticulate]. No recurrent neural netwoks are included in this survey though the survival models SRN [@Oh2018] and RNN-Surv [@Giunchiglia2018] are acknowledged.
This survey is made slightly more difficult as neural networks are often proposed for many different tasks, which are not necessarily clearly advertised in a paper's title or abstract. For example, many papers claim to use neural networks for survival analysis and make comparisons to Cox models, whereas the task tends to be death at a particular (usually 5-year) time-point (classification) [@Han2018; @Lundin1999; @Ripley2001; @Ripley1998; @Seker2002], which is often not made clear until mid-way through the paper. Reviews and surveys have also conflated these different tasks, for example a very recent review concluded superior performance of ANNs over Cox models, when in fact this is only in classification [@Huang2020] (RM2) {sec:car_reduxstrats_mistakes}. To clarify, this form of classification task does fall into the general *field* of survival analysis, but not the survival *task* (@box-task-surv). Therefore this is not a comment on the classification task but a reason for omitting these models from this survey.
Using ANNs for feature selection (often in gene expression data) and computer vision is also very common in survival analysis, and indeed it is in this area that most success has been seen [@Bello2019; @Chen2014; @Cui2020; @Lao2017; @McKinney2020; @Rietschel2018; @Seker2002a; @Zhang2020; @Zhu2016], but these are again beyond the scope of this survey.
The key difference between neural networks is in their output layer, required data transformations, the model prediction, and the loss function used to fit the model. Therefore the following are discussed for each of the surveyed models: the loss function for training, $L$, the model prediction type, $\hatg$, and any required data transformation. Notation is continued from the previous surveys with the addition of $\theta$ denoting model weights (which will be different for each model).
#### Probabilistic Survival Models
Unlike other classes of machine learning models, the focus in ANNs has been on probabilistic models. The vast majority make these predictions via reduction to binary classification \ref{sec:car_reduxes_r7}. Whilst almost all of these networks implicitly reduce the problem to classification, most are not transparent in exactly how they do so and none provide clear or detailed interface points in implementation allowing for control over this reduction. Most importantly, the majority of these models do not detail how valid survival predictions are derived from the binary setting,\footnote{One could assume they use procedures such as those described in Tutz and Schmid (2016) [@Tutz2016] but there is rarely transparent writing to confirm this.} which is not just a theoretical problem as some implementations, such as the Logistic-Hazard model in $\pkg{pycox}$ [@pkgpycox], have been observed to make survival predictions outside the range $[0,1]$. This is not a statement about the performance of models in this section but a remark about the lack of transparency across all probabilistic ANNs.
Many of these algorithms use an approach that formulate the Cox PH as a non-linear model and minimise the partial likelihood. These are referred to as 'neural-Cox' models and the earliest appears to have been developed by Faraggi and Simon [@Faraggi1995]. All these models are technically composites that first predict a ranking, however they assume a PH form and in implementation they all appear to return a probabilistic prediction.
**ANN-COX** {#mod-anncox}\\
Faraggi and Simon [@Faraggi1995] proposed a non-linear PH model
$$
h(\tau|X_i,\theta) = h_0(\tau)\exp(\phi(X_i\beta))
$$ {#eq-surv-farsim}
where $\phi$ is the sigmoid function and $\theta = \{\beta\}$ are model weights. This model, 'ANN-COX', estimates the prediction functional, $\hatg(X^*) = \phi(X^*\hat{\beta})$. The model is trained with the partial-likelihood function
$$
L(\hatg, \theta|\dtrain) = \prod_{i = 1}^n \frac{\exp(\sum^M_{m=1} \alpha_m\hatg_m(X^*))}{\sum_{j \in \mathcal{R}_{t_i}} \exp(\sum^M_{m=1} \alpha_m\hatg_m(X^*))}
$$
where $\mathcal{R}_{t_i}$ is the risk group alive at $t_i$; $M$ is the number of hidden units; $\hatg_m(X^*) = (1 + \exp(-X^*\hat{\beta}_m))^{-1}$; and $\theta = \{\beta, \alpha\}$ are model weights.
The authors proposed a single hidden layer network, trained using back-propagation and weight optimisation with Newton-Raphson. This architecture did not outerperform a Cox PH [@Faraggi1995]. Further adjustments including (now standard) pre-processing and hyper-parameter tuning did not improve the model performance [@Mariani1997]. Further independent studies demonstrated worse performance than the Cox model [@Faraggi1995; @Xiang2000].
**COX-NNET** {#mod-coxnnet}\\
COX-NNET [@Ching2018a] updates the ANN-COX by instead maximising the regularized partial log-likelihood
$$
L(\hatg, \theta|\dtrain, \lambda) = \sum^n_{i=1} \Delta_i \Big[\hatg(X_i) \ - \ \log\Big(\sum_{j \in \calR_{t_i}} \exp(\hatg(X_j))\Big)\Big] + \lambda(\|\beta\|_2 + \|w\|_2)
$$
with weights $\theta = (\beta, w)$ and where $\hatg(X_i) = \sigma(wX_i + b)^T\beta$ for bias term $b$, and activation function $\sigma$; $\sigma$ is chosen to be the tanh function ((@eq-surv-tanh)). In addition to weight decay, dropout [@Srivastava2014] is employed to prevent overfitting. Dropout can be thought of as a similar concept to the variable selection in random forests, as each node is randomly deactivated with probability $p$, where $p$ is a hyper-parameter to be tuned.
Independent simulation studies suggest that COX-NNET does not outperform the Cox PH [@Gensheimer2019].
**DeepSurv** {#mod-deepsurv}\\
DeepSurv [@Katzman2018] extends these models to deep learning with multiple hidden layers. The chosen error function is the average negative log-partial-likelihood with weight decay
$$
L(\hatg, \theta|\dtrain, \lambda) = -\frac{1}{n^*} \sum_{i = 1}^n \Delta_i \Big[ \Big(\hatg(X_i) - \log \sum_{j \in \calR_{t_i})} \exp(\hatg(X_j)\Big)\Big] + \lambda\|\theta\|^2_2
$$
where $n^* := \sum^n_{i=1} \II(\Delta_i = 1)$ is the number of uncensored observations and $\hat{g}(X_i) = \phi(X_i|\theta)$ is the same prediction object as the ANN-COX. State-of-the-art methods are used for data pre-processing and model training. The model architecture uses a combination of fully-connected and dropout layers. Benchmark experiments by the authors indicate that DeepSurv can outperform the Cox PH in ranking tasks [@Katzman2016; @Katzman2018] although independent experiments do not confirm this [@Zhao2020].
\newpage
\noindent**Cox-Time** {#mod-coxtime}\\
Kvamme $\etal$ [@Kvamme2019a] build on these models by allowing time-varying effects. The loss function to minimise, with regularization, is given by
$$
L(\hatg, \theta|\dtrain, \lambda) = \frac{1}{n} \sum_{i:\Delta_i = 1} \log\Big(\sum_{j \in \calR_{t_i}} \exp[\hatg(X_j,T_i) - \hatg(X_i, T_i)]\Big) + \lambda \sum_{i:\Delta_i=1}\sum_{j \in \calR_{t_i}} |\hatg(X_j,T_i)|
$$
where $\hatg = \hatg_1,...,\hatg_n$ is the same non-linear predictor but with a time interaction and $\lambda$ is the regularization parameter. The model is trained with stochastic gradient descent and the risk set, $\calR_{t_i}$, in the equation above is instead reduced to batches, as opposed to the complete dataset. ReLU activations [@Nair2010] and dropout are employed in training. Benchmark experiments indicate good performance of Cox-Time, though no formal statistical comparisons are provided and hence no comment about general performance can be made.
**ANN-CDP** {#mod-anncdp}\\
One of the earliest ANNs that was noted by Schwarzer $\etal$ [@Schwarzer2000] was developed by Liest\o l $\etal$ [@Liestol1994] and predicts conditional death probabilities (hence 'ANN-CDP'). The model first partitions the continuous survival times into disjoint intervals $\calI_k, k = 1,...,m$ such that $\calI_k$ is the interval $(t_{k-1}, t_k]$. The model then studies the logistic Cox model (proportional odds) [@Cox1972] given by
$$
\frac{p_k(\xx)}{q_k(\xx)} = \exp(\eta + \theta_k)
$$
where $p_k = 1-q_k$, $\theta_k = \log(p_k(0)/q_k(0))$ for some baseline probability of survival, $q_k(0)$, to be estimated; $\eta$ is the usual linear predictor, and $q_k = P(T \geq T_k | T \geq T_{k-1})$ is the conditional survival probability at time $T_k$ given survival at time $T_{k-1}$ for $k = 1,...,K$ total time intervals. A logistic activation function is used to predict $\hatg(X^*) = \phi(\eta + \theta_k)$, which provides an estimate for $\hat{p}_k$.
The model is trained on discrete censoring indicators $D_{ki}$ such that $D_{ki} = 1$ if individual $i$ dies in interval $\calI_k$ and $0$ otherwise. Then with $K$ output nodes and maximum likelihood estimation to find the model parameters, $\hat{\eta}$, the final prediction provides an estimate for the conditional death probabilities $\hatp_k$. The negative log-likelihood to optimise is given by
$$
L(\hatg, \theta|\dtrain) = \sum^n_{i=1}\sum^{m_i}_{k=1} [D_{ki}\log(\hat{p}_k(X_i)) + (1-D_{ki})\log(\hat{q}_k(X_i))]
$$
where $m_i$ is the number of intervals in which observation $i$ is not censored.
Liest\o l $\etal${} discuss different weighting options and how they correspond to the PH assumption. In the most generalised case, a weight-decay type regularization is applied to the model weights given by
$$
\alpha \sum_l \sum_k (w_{kl} - w_{k-1,l})^2
$$
where $w$ are weights, and $\alpha$ is a hyper-parameter to be tuned, which can be used alongside standard weight decay. This corresponds to penalizing deviations from proportionality thus creating a model with approximate proportionality. The authors also suggest the possibility of fixing the weights to be equal in some nodes and different in others; equal weights strictly enforces the proportionality assumption. Their simulations found that removing the proportionality assumption completely, or strictly enforcing it, gave inferior results. Comparing their model to a standard Cox PH resulted in a 'better' negative log-likelihood, however this is not a precise evaluation metric and an independent simulation would be preferred. Finally List\o l $\etal$ included a warning `'The flexibility is, however, obtained at unquestionable costs: many parameters, difficult interpretation of the parameters and a slow numerical procedure'' [@Liestol1994].
**PLANN** {#mod-plann}\\
Biganzoli $\etal$ (1998) [@Biganzoli1998] studied the same proportional-odds model as the ANN-CDP [@Liestol1994]. Their model utilises partial logistic regression [@Efron1988] with added hidden nodes, hence 'PLANN'. Unlike ANN-CDP, PLANN predicts a smoothed hazard function by using smoothing splines. The continuous time outcome is again discretised into disjoint intervals $t_m, m = 1,...,M$. At each time-interval, $t_m$, the number of events, $d_m$, and number of subjects at risk, $n_m$, can be used to calculate the discrete hazard function,\footnote{Derivation of this as a 'hazard' estimator follows trivially by comparison to the Nelson-Aalen estimator.}
$$
\hat{h}_m = \frac{d_m}{n_m}, m = 1,...,M
$$ {#eq-surv-dischaz}
This quantity is used as the target to train the neural network. The survival function is then estimated by the Kaplan-Meier type estimator,
$$
\hatS(\tau) = \prod_{m:t_m \leq \tau} (1 - \hat{h}_m)
$$ {#eq-surv-discreteKM}
The model is fit by employing one of the more 'usual' survival reduction strategies in which an observation's survival time is treated as a covariate in the model [@Tutz2016]. As this model uses discrete time, the survival time is discretised into one of the $M$ intervals. This approach removes the proportional odds constraint as interaction effects between time and covariates can be modelled (as time-updated covariates). Again the model makes predictions at a given time $m$, $\phi(\theta_m + \eta)$, where $\eta$ is the usual linear predictor, $\theta$ is the baseline proportional odds hazard $\theta_m = \log(h_m(0)/(1-h_m(0))$. The logistic activation provides estimates for the discrete hazard,
$$
h_m(X_i) = \frac{\exp(\theta_m + \hat{\eta})}{1 + \exp(\theta_m + \hat{\eta})}
$$
which is smoothed with cubic splines [@Efron1988] that require tuning.
A cross-entropy error function is used for training
$$
L(\hath, \theta|\dtrain, a) = - \sum^M_{m = 1} \Big[\hat{h}_m \log \Big(\frac{h_l(X_i, a_l)}{\hat{h}_m}\Big) + (1 - \hat{h}_m) \log \Big(\frac{1 - h_l(X_i, a_l)}{1 - \hat{h}_m}\Big)\Big]n_m
$$
where $h_l(X_i, a_l)$ is the discrete hazard $h_l$ with smoothing at mid-points $a_l$. Weight decay can be applied and the authors suggest $\lambda \approx 0.01-0.1$ [@Biganzoli1998], though they make use of an AIC type criterion instead of cross-validation.
This model makes smoothed hazard predictions at a given time-point, $\tau$, by including $\tau$ in the input covariates $X_i$. Therefore the model first requires transformation of the input data by replicating all observations and replacing the single survival indicator $\Delta_i$, with a time-dependent indicator $D_{ik}$, the same approach as in ANN-CDP. Further developments have extended the PLANN to Bayesian modelling, and for competing risks [@Biganzoli2009].
No formal comparison is made to simpler model classes. The authors recommend ANNs primarily for exploration, feature selection, and understanding underlying patterns in the data [@Biganzoli2009].
**Nnet-survival** {#mod-nnetsurvival}\\
Aspects of the PLANN algorithm have been generalised into discrete-time survival algorithms in several papers [@Gensheimer2019; @Kvamme2019; @Mani1999; @Street1998]. Various estimates have been derived for transforming the input data to a discrete hazard or survival function. Though only one is considered here as it is the most modern and has a natural interpretation as the 'usual' Kaplan-Meier estimator for the survival function. Others by Street (1998) [@Street1998] and Mani (1999) [@Mani1999] are acknowledged. The discrete hazard estimator (@eq-surv-dischaz), $\hath$, is estimated and these values are used as the targets for the ANN. For the error function, the mean negative log-likelihood for discrete time [@Kvamme2019] is minimised to estimate $\hat{h}$,
$$
\begin{split}
L(\hath, \theta|\dtrain) = -\frac{1}{n} \sum^n_{i=1}\sum^{k(T_i)}_{j=1} (\II(T_i = \tau_j, \Delta_i = 1) \log[\hat{h}_i(\tau_j)] \ + \\
(1-\II(T_i = \tau_j, \Delta_i = 1))\log(1 - \hat{h}_i(\tau_j)))
\end{split}
$$
where $k(T_i)$ is the time-interval index in which observation $i$ dies/is censored, $\tau_j$ is the $j$th discrete time-interval, and the prediction of $\hat{h}$ is obtained via
$$
\hat{h}(\tau_j|\dtrain) = [1 + \exp(-\hat{g}_j(\dtrain))]^{-1}
$$
where $\hat{g}_j$ is the $j$th output for $j = 1,...,m$ discrete time intervals. The number of units in the output layer for these models corresponds to the number of discrete-time intervals. Deciding the width of the time-intervals is an additional hyper-parameter to consider.
Gensheimer and Narasimhan's 'Nnet-survival' [@Gensheimer2019] has two different implementations. The first assumes a PH form and predicts the linear predictor in the final layer, which can then be composed to a distribution. Their second 'flexible' approach instead predicts the log-odds of survival in each node, which are then converted to a conditional probability of survival, $1 - h_j$, in a given interval using the sigmoid activation function. The full survival function can be derived with (@eq-surv-discreteKM). The model has been demonstrated not to outperform the Cox PH with respect to Harrell's C or the Graf (Brier) score [@Gensheimer2019].
**PC-Hazard** {#mod-pchazard}\\
Kvamme and Borgan deviate from nnet-survival in their 'PC-Hazard' [@Kvamme2019] by first considering a discrete-time approach with a softmax activation function influenced by multi-class classification. They expand upon this by studying a piecewise constant hazard function in continuous time and defining the mean negative log-likelihood as
$$
L(\hatg, \theta|\dtrain) = -\frac{1}{n} \sum^n_{i=1} \Big(\Delta_i X_i\log\tilde{\eta}_{k(T_i)} - X_i\tilde{\eta}_{k(T_i)}\rho(T_i) - \sum^{k(T_i)-1}_{j=1} \tilde{\eta}_jX_i\Big)
$$
where $k(T_i)$ and $\tau_i$ is the same as defined above, $\rho(t) = \frac{t - \tau_{k(t)-1}}{\Delta\tau_{k(t)}}$, $\Delta\tau_j = \tau_j - \tau_{j-1}$, and $\tilde{\eta}_j := \log(1 + \exp(\hatg_j(X_i))$ where again $\hatg_j$ is the $j$th output for $j = 1,...,m$ discrete time intervals. Once the weights have been estimated, the predicted survival function is given by
$$
\hatS(\tau, X^*|\dtrain) = \exp(-X^*\tilde{\eta}_{k(\tau)}\rho(\tau)) \prod^{k(\tau)-1}_{j=1} \exp(-\tilde{\eta}_j(X^*))
$$
Benchmark experiments indicate similar performance to nnet-survival [@Kvamme2019], an unsurprising result given their implementations are identical with the exception of the loss function [@Kvamme2019], which is also similar for both models. A key result found that varying values for interval width lead to significant differences and therefore should be carefully tuned.
**DNNSurv** {#mod-dnnsurv}\\
A very recent (pre-print) approach [@Zhao2020] instead first computes 'pseudo-survival probabilities' and uses these to train a regression ANN with sigmoid activation and squared error loss. These pseudo-probabilities are computed using a jackknife-style estimator given by
$$
\tilde{S}_{ij}(T_{j+1}, \calR_{t_j}) = n_j\hatS(T_{j+1}|\calR_{t_j}) - (n_j - 1)\hatS^{-i}(T_{j+1}|\calR_{t_j})
$$
where $\hatS$ is the IPCW weighted Kaplan-Meier estimator (defined below) for risk set $\calR_{t_j}$, $\hatS^{-i}$ is the Kaplan-Meier estimator for all observations in $\calR_{t_j}$ excluding observation $i$, and $n_j := |\calR_{t_j}|$. The IPCW weighted Kaplan-Meier estimate is found via the IPCW Nelson-Aalen estimator,
$$
\hat{H}(\tau|\dtrain) = \sum^n_{i=1} \int^\tau_0 \frac{\II(T_i \leq u, \Delta_i = 1)\hatW_i(u)}{\sum^n_{j=1} \II(T_j \geq u) \hatW_j(u)} \ du
$$
where $\hatW_i,\hatW_j$ are subject specific IPC weights.
In their simulation studies, they found no improvement over other proposed neural networks. Arguably the most interesting outcome of their paper are comparisons of multiple survival ANNs at specific time-points, evaluated with C-index and Brier score. Their results indicate identical performance from all models. They also provide further evidence of neural networks not outperforming a Cox PH when the PH assumption is valid. However, in their non-PH dataset, DNNSurv appears to outperform the Cox model (no formal tests are provided). Data is replicated similarly to previous models except that no special indicator separates censoring and death, this is assumed to be handled by the IPCW pseudo probabilities.
<!-- %**RNN-SURV** {#mod-rnnsurv}\\
%\hl{consider deleting this model}
%RNN-SURV [@Giunchiglia2018] again uses a reduction to binary classification approach though makes use of recurrent layers to incorporate the sequential nature of survival data and time-variant features. The final layer in the model consists of $K$ nodes, which correspond with individual survival probability predictions for interval $k$, $\hatS(\tau_k|X)$, where $\tau_k, k = 1,...,K,$ are $K$ discrete time-intervals $\{(\tau_0, \tau_1],...,(\tau_{K-1}, \tau_K)]\}$. The model assumes constant hazards within each time interval. These estimates can be linearly combined to predict a single risk score for each observation, $\hat{r}_i = \sum^K_{k=1} \theta_k\hatS_i(\tau_k)$ where $\theta_k$ are the weights from the final model layer. In order to accommodate these predictions the data is again first transformed to consist of vectors of covariates with their corresponding time interval. The model uses a composite loss built from a linear combination of two individual losses with weight decay,
%$$
%L(\hatS, t, \delta|\theta, \alpha, \gamma) = \alpha L_1(\hatS, t, \delta|\theta) + \gamma L_2(\hatS, t,\delta|\theta) + \lambda\|\theta\|^2_2
%$$
%where $\alpha, \beta, \gamma$ are hyper-parameters to tune, $\theta$ are the model weights, and
%$$
%L_1(\hatS, t,\delta| \theta) = -\sum^K_{k=1}\sum_{i \in U_k} [\II(t_i > \tau_k)\log(\hatS_i(\tau_k)) + (1-\II(t_i > \tau_k))\log(1-\hatS_i(\tau_k)]
%$$ {#eq-surv-rnnsurv-cross}
%where $U_i = \{i : \delta_i = 1 \cup \delta_i = 0 \cap T_i > \tau_k\}$ is the set of observations who are either alive, dead, or not-yet-censored. $L_1$ can be recognised as a cross-entropy loss [@Graf1999]. $L_2$ is an upper bound of the negative C-index [@Steck2008],
%$$
%L_2(\theta) = -\frac{1}{|\calC|}\sum_{(i,j)\in\calC} \Big[1 + \Big(\frac{\log \ \sigma(\hat{r}_j-\hat{r}_i)}{\log \ 2}\Big)\Big]
%$$
%where $\calC = \{(i,j) : \delta_i = 1 \cap t_i \leq t_j\}$. The ReLU activation function is used for the feed-forward layers, long short-term memory (LSTM) [@Hochreiter1997] cells in the recurrent layers, and the sigmoid function in the final layer. Dropout is again employed. The model demonstrates good sepration ability with respect to Harrell's C and is shown to significantly outperform Cox PH, RSFs [@Ishwaran2008], and DeepSurv [@Katzman2018]. No experiments are performed to assess the predictive performance of the predicted survival distributions. No explicit methodology is provided for enforcing a decreasing monotonic prediction for the discrete survival distribution, which is a general problem that is discussed further in (@sec-car). No off-shelf implementation is available.
% -->
**DeepHit** {#mod-deephit}\\
DeepHit [@Lee2018a] was originally built to accommodate competing risks, but only the non-competing case is discussed here [@Kvamme2019a]. The model builds on previous approaches by discretising the continuous time outcome, and makes use of a composite loss. It has the advantage of making no parametric assumptions and directly predicts the probability of failure in each time-interval (which again correspond to different terminal nodes), i.e. $\hat{g}(\tau_k|\dtest) = \hat{P}(T^* = \tau_k|X^*)$ where again $\tau_k, k = 1,...,K$ are the distinct time intervals. The estimated survival function is found with $\hatS(\tau_K|X^*) = 1 - \sum^K_{k = 1} \hat{g}_i(\tau_k|X^*)$. ReLU activations were used in all fully connected layers and a softmax activation in the final layer. The losses in the composite error function are given by
$$
L_1(\hatg, \theta|\dtrain) = -\sum^N_{i=1} [\Delta_i \log(\hat{g}_i(T_i)) + (1-\Delta_i)\log(\hatS_i(T_i))]
$$
and
$$
L_2(\hatg, \theta|\dtrain, \sigma) = \sum_{i \neq j} \Delta_i \II(T_i < T_j) \sigma(\hatS_i(T_i), \hatS_j(T_i))
$$
for some convex loss function $\sigma$ and where $\hat{g}_i(t) = \hat{g}(t|X_i)$. Again these can be seen to be a cross-entropy loss and a ranking loss. Benchmark experiments demonstrate the model outperforming the Cox PH and RSFs [@Lee2018a] with respect to separation, and an independent experiment supports these findings [@Kvamme2019a]. However, the same independent study demonstrated worse performance than a Cox PH with respect to the integrated Brier score [@Graf1999].
#### Deterministic Survival Models
Whilst the vast majority of survival ANNs have focused on probabilistic predictions (often via ranking), a few have also tackled the deterministic or 'hybrid' problem.
**RankDeepSurv** {#mod-rankdeepsurv}\\
Jing $\etal$ [@Jing2019] observed the past two decades of research in survival ANNs and then published a completely novel solution, RankDeepSurv, which makes predictions for the survival time $\hat{T} = (\hat{T}_1,...,\hat{T}_n)$. They proposed a composite loss function
$$
L(\hatT, \theta|\dtrain, \alpha,\gamma,\lambda) = \alpha L_1(\hat{T},T,\Delta) + \gamma L_2(\hat{T},T,\Delta) + \lambda\|\theta\|^2_2
$$
where $\theta$ are the model weights, $\alpha,\gamma \in \PReals$, $\lambda$ is the shrinkage parameter, by a slight abuse of notation $T = (T_1,...,T_n)$ and $\Delta = (\Delta_1,...,\Delta_n)$, and
$$
L_1(\hatT, \theta|\dtrain) = \frac{1}{n} \sum_{\{i: I(i) = 1\}} (\hat{T}_i - T_i)^2;
\quad I(i) =
\begin{cases}
1, & \Delta_i = 1 \cup (\Delta_i = 0 \cap \hat{T}_i \leq T_i) \\
0, & \otherw
\end{cases}
$$
$$
L_2(\hatT, \theta|\dtrain) = \frac{1}{n}\sum^n_{\{i,j : I(i,j) = 1\}} [(T_j - T_i) - (\hat{T}_j - \hat{T}_i)]^2;
\quad
I(i,j) =
\begin{cases}
1, & T_j - T_i > \hat{T}_j - \hat{T}_i \\
0, & \otherw
\end{cases}
$$
where $\hat{T}_i$ is the predicted survival time for observation $i$. A clear contrast can be made between these loss functions and the constraints used in SSVM-Hybrid [@VanBelle2011b] (@sec-surv-ml-models-svm-surv). $L_1$ is the squared second constraint in \ref{eq:surv_ssvmvb2} and $L_2$ is the squared first constraint in \ref{eq:surv_ssvmvb2}. However $L_1$ in RankDeepSurv discards the squared error difference for all censored observations when the prediction is lower than the observed survival time; which is problematic as if someone is censored at time $T_i$ then it is guaranteed that their true survival time is greater than $T_i$ (this constraint may be more sensible if the inequality were reversed). An advantage to this loss is, like the SSVM-Hybrid, it enables a survival time interpretation for a ranking optimised model; however these 'survival times' should be interpreted with care.
The authors propose a model architecture with several fully connected layers with the ELU [@Clevert2015] activation function and a single dropout layer. Determining the success of this model is not straightforward. The authors claim superiority of RankDeepSurv over Cox PH, DeepSurv, and RSFs however this is an unclear comparison (RM2) {sec:car_reduxstrats_mistakes} that requires independent study.
### Conclusions
There have been many advances in neural networks for survival analysis. It is not possible to review all proposed survival neural networks without diverting too far from the book scope. This survey of ANNs should demonstrate two points: firstly that the vast majority (if not all) of survival ANNs are reduction models that either find a way around censoring via imputation or discretisation of time-intervals, or by focusing on partial likelihoods only; secondly that no survival ANN is fully accessible or transparent.
Despite ANNs being highly performant in other areas of supervised learning, there is strong evidence that the survival ANNs above are inferior to a Cox PH when the data follows the PH assumption or when variables are linearly related [@Gensheimer2018; @Luxhoj1997; @Ohno-Machado1997; @Puddu2012; @Xiang2000; @Yang2010; @Yasodhara2018; @Zhao2020]. There are not enough experiments to make conclusions in the case when the data is non-PH. Experiments in [@Sonabend2021b] support the finding that survival ANNs are not performant.
There is evidence that many papers introducing neural networks do not utilise proper methods of comparison or evaluation [@Kiraly2018d] and in conducting this survey, these findings are further supported. Many papers made claims of being 'superior' to the Cox model based on unfair comparisons (RM2){sec:car_reduxstrats_mistakes} or miscommunicating (or misinterpreting) results (e.g. [@Fotso2018]). At this stage, it does not seem possible to make any conclusions about the effectiveness of neural networks in survival analysis. Moreover, even the authors of these models have pointed out problems with transparency [@Biganzoli2009; @Liestol1994], which was further highlighted by Schwarzer $\etal$ [@Schwarzer2000].
Finally, accessibility of neural networks is also problematic. Many papers do not release their code and instead just state their networks architecture and available packages. In theory, this is enough to build the models however this does not guarantee the reproducibility that is usually expected. For users with a technical background and good coding ability, many of the models above could be implemented in one of the neural network packages in $\Rstats$, such as $\pkg{nnet}$ [@pkgnnet] and $\pkg{neuralnet}$ [@pkgneuralnet]; though in practice the only package that does contain these models, $\pkg{survivalmodels}$, does not directly implement the models in $\Rstats$ (which is much slower than Python) but provides a method for interfacing the Python implementations in $\pkg{pycox}$ [@pkgpycox].