book/forests.qmd

---
abstract: TODO (150-200 WORDS)
---

{{< include _setup.qmd >}}

# Tree-Based Methods

{{< include _wip.qmd >}}

## Random Forests {#sec-surv-ml-models-ranfor}

### Random Forests for Regression

Random forests are a composite algorithm built by fitting many simpler component models, decision trees, and then averaging the results of predictions from these trees. Decision trees are first briefly introduced before the key 'bagging' algorithm that composes these trees to a random forest. Woodland terminology is used throughout this subsection.

#### Decision Trees {.unnumbered .unlisted}

Decision trees are a common model class in machine learning and have the advantage of being (relatively) simple to implement and highly interpretable. A decision tree takes a set of inputs and a given *splitting rule* in order to create a series of splits, or branches, in the tree that culminates in a final *leaf*, or *terminal node*. Each terminal node has a corresponding prediction, which for regression is usually the sample mean of the training outcome data. This is made clearer by example, (@fig-surv-ranfor) demonstrates a decision tree predicting the miles per gallon (`mpg`) of a car from the `mtcars`  [@datamtcars] dataset. With this tree a new prediction is made by feeding the input variables from the top to the bottom, for example given new data, $x = \{`wt` = 3, `disp` = 250\}$, then in the first split the right branch is taken as `wt` $= 3 > 2.32$ and in the second split the left branch is taken as `disp` $= 250 \leq 258$, therefore the new data point 'lands' in the final leaf and is predicted to have an `mpg` of $20.8$. This value of $20.8$ arises as the sample mean of `mpg` for the $11$ (which can be seen in the box) observations in the training data who were sorted into this terminal node. Algorithmically, as splits are always binary, predictions are simply a series of conditional logical statements.

![Demonstrating classification trees using the `mtcars`  [@datamtcars] dataset and the $\pkg{party}$  [@pkgparty] package. Ovals are leaves, which indicate the variable that is being split. Edges are branches, which indicate the cut-off at which the variable is split. Rectangles are terminal nodes and include information about the number of training observations in the node and the terminal node prediction.](Figures/forests/iris_tree.png){#fig-surv-ranfor fig-alt="TODO"}

#### Splitting Rules {.unnumbered .unlisted}

Precisely how the splits are derived and which variables are utilised is determined by the splitting rule.\footnote{Other methods for growing trees such as pruning are not discussed here as they are less relevant to random forests, which are primarily of interest. Instead see (e.g.) Breiman (1984)  [@Breiman1984].} In regression, the most common splitting rule is to select the cut-off for a given variable that minimises the mean squared error in each hypothetical resultant leaf. The goal is to find the variable and cutoff that leads to the greatest difference between the two resultant leaves and thus the maximal homogeneity within each leaf. For all decision tree and random forest algorithms going forward, let $L$ denote some leaf, then let $L_{xy}, L_x, L_y$ respectively be the set of observations, features, and outcomes in leaf $L$. Let $L_{y;i}$ be the $i$th outcome in $L_y$ and finally let $L_{\bar{y}} = \mean{L_{y;i}}$. To simplify notation, $i \in L$ is taken to be equivalent to $i \in \{i: X_i \in L_X\}$, i.e. the indices of the observations in leaf $L$.

Let $c \in \Reals$ be some cutoff parameter and let $L^a_{xy}(j,c) := \{(X_i,Y_i)|X_{ij} < c, i = 1,...,n\}, L^b_{xy}(j,c) = \{(X_i,Y_i)|X_{ij} \geq c, i = 1,...,n\}$ be the two leaves containing the set of observations resulting from partitioning variable $j$ at cutoff $c$. Then a split is determined by finding the arguments, $(j^*,c^*)$ that minimise the sum of the mean squared errors (MSE) in both leaves  [@Hastie2013],
$$
(j^*, c^*) = \argmin_{j, c} \sum_{y \in L^a_{y}(j,c)} (y - L^a_{\bar{Y}}(j,c))^2 + \sum_{y \in L^b_{y}(j,c)} (y - L^b_{\bar{Y}}(j,c))^2
$$ {#eq-dt-min}
This method is repeated from the first branch of the tree down to the very last such that observations are included in a given leaf $L$ if they satisfy all conditions from all previous branches; features may be considered multiple times in the growing process. This is an intuitive method as minimising the above sum results in the set of observations within each individual leaf being as similar as possible, thus as an observation is passed down the tree, it becomes more similar to the subsequent leaves, eventually landing in a leaf containing homogeneous observations. Controlling how many variables to consider at each split and how many splits to make are determined by hyper-parameter tuning.

Decision trees are a powerful method for high-dimensional data as only a small sample of variables will be used for growing a tree, and therefore they are also useful for variable importance by identifying which variables were utilised in growth (other importance methods are also available). Decision trees are also highly interpretable, as demonstrated by (@fig-surv-ranfor). The recursive pseudo-algorithm in (@alg-dt-fit) demonstrates the simplicity in growing a decision tree (again methods such as pruning are omitted).

\begin{algorithm}
\caption{Fitting a decision tree. \\
**Input** Training data, $\dtrain$. Splitting rule, $SR$. \\
**Output** Fitted decision tree, $\hatg$.}
\begin{algorithmic}[1]
\State Compute $(j^*, c^*)$ as the optimisers of $SR$ (e.g. (@eq-dt-min)) to create the initial leaf and branches.
\State Repeat step 1 on all subsequent branches until a stopping rule is reached.
\State Return the fitted tree, $\hatg$, as the series of branches.
\end{algorithmic}
\end{algorithm}
<!-- {#alg-dt-fit} -->

#### Stopping Rules {.unnumbered .unlisted}

The 'stopping rule' in (@alg-dt-fit) is usually a condition on the number of observations in each leaf such that leaves will continue to be split until some minimum number of observations has been reached in a leaf. Other conditions may be on the 'depth' of the tree, which corresponds to the number of levels of splitting, for example the tree in (@fig-surv-ranfor) has a depth of $2$ (the first level is not counted).

#### Random Forests {.unnumbered .unlisted}

Despite being more interpretable than other machine learning methods, decision trees usually have poor predictive performance, high variance and are not robust to changes in the data. As such, *random forests* are preferred to improve prediction accuracy and decrease variance. Random forests utilise bootstrap aggregation, or *bagging*  [@Breiman1996a], to aggregate many decision trees. A pseudo fitting algorithm is given in  (@alg-rsf-fit).

\begin{algorithm}
\caption{Fitting a random forest. \\
**Input** Training data, $\dtrain$. Total number of trees, $B \in \PNaturals$. \\
**Output** Fitted random forest, $\hatg$.}
\begin{algorithmic}[1]
\For{$b = 1,...,B$}
\State Create a bootstrapped sample of the data, $D_b$.
\State Grow a decision tree, $\hatg_b$, on $D_b$ with (@alg-dt-fit).
\EndFor
\State $\hatg \gets \{\hatg_b\}^B_{b=1}$
\Return $\hatg$
\end{algorithmic}
\end{algorithm}
<!-- {#alg-rsf-fit} -->

Prediction from a random forest follows by making predictions from the individual trees and aggregating the results by some function $\sigma$ (@alg-rsf-pred); $\sigma$ is usually the sample mean for regression,

$$
\hatg(X^*) = \sigma(\hatg_1(X^*),...,\hatg_B(X^*)) = \frac{1}{B} \sum^B_{b=1} \hatg_b(X^*)
$$

where $\hatg_b(X^*)$ is the terminal node prediction from the $b$th tree and $B$ are the total number of grown trees (`$B$' is commonly used instead of `$N$' to note the relation to bootstrapped data).

\begin{algorithm}
\caption{Predicting from a random forest. \\
**Input** Testing data $X^* \sim \calX$, fitted forest $\hatg$ with $B \in \PNaturals$ trees, aggregation method $\sigma$. \\
**Output** Prediction, $\hatY \sim \calY$.}
\begin{algorithmic}[1]
\For{$b = 1,...,B$}
\State 'Drop' $X^*$ down the tree $\hatg_b$ individually to return a prediction $\hatg_b(X^*)$.
\EndFor
\State $\hatY \gets \sigma(\hatg_1(X^*),...,\hatg_B(X^*))$
\Return $\hatY$
\end{algorithmic}
\end{algorithm}
<!-- {#alg-rsf-pred} -->

Usually many (hundreds or thousands) trees are grown, which makes random forests robust to changes in data and 'confident' about individual predictions. Other advantages include having several tunable hyper-parameters, including: the number of trees to grow, the number of variables to include in a single tree, the splitting rule, and the minimum terminal node size. Machine learning models with many hyper-parameters, tend to perform better than other models as they can be fine-tuned to the data, which is why complex deep learning models are often the best performing. Although as a caveat: too many parameters can lead to over-fitting and tuning many parameters can take a long time and be highly intensive. Random forests lose the interpretability of decision trees and are considered 'black-box' models as individual predictions cannot be easily scrutinised.

### Random Forests for Survival Analysis

Given time constraints and the scope of this book, this survey of random forests for survival analysis will primarily focus on 'traditional' decision trees and random forests and will not look at other sub-fields such as causal forests. A comprehensive review of random survival forests (RSFs) is provided in Bou-Hamad (2011) [@Bou-Hamad2011], which includes extensions to time-varying covariates and different censoring types. In order to prevent overlap, this survey will focus primarily on methods that have off-shelf implementations, their prediction types, and how successfully these methods handle the problem of censoring. Random forests and decision trees for survival are termed from here as Random Survival Forests (RSFs) and Survival Decision Trees (SDTs) respectively.

Unlike other machine learning methods that may require complex changes to underlying algorithms, individual components of a random forest can be adapted without altering the fundamental algorithm. The principle random forest algorithm is unchanged for RSFs, the difference is in the choice of splitting rule and terminal node prediction, which both must be able to handle censoring. Therefore instead of discussing individual algorithms, the different choices of splitting rules and terminal node predictions are discussed, then combinations of these are summarised into five distinct algorithms.

#### Splitting Rules

Survival trees and RSFs have been studied for the past four decades and whilst the amount of splitting rules to appear could be considered `'numerous'' [@Bou-Hamad2011], only two broad classes are commonly utilised and implemented  [@pkgrfsrc; @pkgsksurvival; @pkgrpart; @pkgranger]. The first class rely on hypothesis tests, and primarily the log-rank test, to maximise dissimilarity between splits, the second class utilises likelihood-based measures. The first is discussed in more detail as this is common in practice and is relatively straightforward to implement and understand, moreover it has been demonstrated to outperform other splitting rules [@Bou-Hamad2011]. Likelihood rules are more complex and require assumptions that may not be realistic, these are discussed briefly.

#### Hypothesis Tests {.unnumbered .unlisted}

The log-rank test statistic has been widely utilised as the 'natural' splitting-rule for survival analysis  [@Ciampi1986; @Ishwaran2008; @LeBlanc1993; @Segal1988].  The log-rank test compares the survival distributions of two groups and has the null-hypothesis that both groups have the same underlying risk of (immediate) death, i.e. identical hazard functions.

Let $L^A$ and $L^B$ be two leaves then using the notation above let $h^A,h^B$ be the (true) hazard functions derived from the observations in the two leaves respectively. The log-rank hypothesis test is given by $H_0: h^A = h^B$ with test statistic  [@Segal1988],
$$
LR(L^A) = \frac{\sum_{\tau \in \calU_D} (d^A_{\tau} - e^A_{\tau})}{\sqrt{\sum_{\tau \in \calU_D} v_\tau^A}}
$$
where $d^A_{\tau}$ is the observed number of deaths in leaf $A$ at $\tau$,
$$
d^A_{\tau} := \sum_{i \in L^A} \II(T_i = \tau, \Delta_i = 1)
$$
$e^A_{\tau}$ is the expected number of deaths in leaf $A$ at $\tau$,
$$
e^A_{\tau} := \frac{n_\tau^A d_\tau}{n_\tau}
$$
and $v^A_\tau$ is the variance of the number of deaths in leaf $A$ at $\tau$,
$$
v^A_{\tau} := e^A_{\tau} \Big(\frac{n_\tau - d_\tau}{n_\tau}\Big)\Big(\frac{n_\tau - n^A_\tau}{n_\tau - 1}\Big)
$$
where $\calU_D$ is the set of unique death times across the data (in both leaves), \\ $n_\tau = \sum_i \II(T_i \geq \tau)$ is the number of observations at risk at $\tau$ in both leaves, \\ $n_\tau^A = \sum_{i \in L^A} \II(T_i \geq \tau)$ is the number of observations at risk at $\tau$ in leaf A, and \\ $d_\tau = \sum_i \II(T_i = \tau, \Delta_i = 1)$ is the number of deaths at $\tau$ in both leaves.

Intuitively these results follow as the number of deaths in a leaf is distributed according to $\Hyper(n^A_\tau,n_\tau,d_\tau)$. The same statistic results if $L^B$ is instead considered. (@alg-dt-fit) follows for fitting decision trees with the log-rank splitting rule, $SR$, to be maximised.

The higher the log-rank statistic, the greater the dissimilarity between the two groups, thereby making it a sensible splitting rule for survival, moreover it has been shown that it works well for splitting censored data  [@LeBlanc1993].\footnote{The results of this experiment are actually in LeBlanc's unpublished 1989 PhD thesis and therefore it has to be assumed that LeBlanc is accurately conveying its results in this 1993 paper.} When censoring is highly dependent on the outcome, the log-rank statistic does not perform well and is biased  [@Bland2004], which tends to be true of the majority of survival models. Additionally, the log-rank test requires no knowledge about the shape of the survival curves or distribution of the outcomes in either group  [@Bland2004], making it ideal for an automated process that requires no user intervention.

The log-rank *score* rule  [@Hothorn2003] is a standardized version of the log-rank rule that could be considered as a splitting rule, though simulation studies have demonstrated non-significant predictive performance when comparing the two  [@Ishwaran2008].

Alternative dissimiliarity measures and tests have also been suggested as splitting rules, including modified Kolmogorov-Smirnov test and Gehan-Wilcoxon tests  [@Ciampi1988]. Simulation studies have demonstrated that both of these may have higher power and produce 'better' results than the log-rank statistic  [@Fleming1980]. Despite this, these do not appear to be in common usage and no implementation could be found that include these.

\noindent #### Likelihood Based Rules {.unnumbered .unlisted}
Likelihood ratio statistics, or deviance based splitting rules, assume a certain model form and thereby an assumption about the data. This may be viewed as an advantageous strategy, as it could arguably increase interpretability, or a disadvantage as it places restrictions on the data. For survival models, a full-likelihood can be estimated with a Cox form by estimating the cumulative hazard function  [@LeBlanc1992]. LeBlanc and Crowley (1992)  [@LeBlanc1992] advocate for selecting the optimal split by maximising the full PH likelihood, assuming the cumulative hazard function, $H$, is known,
$$
\calL := \prod_{m = 1}^M \prod_{i \in L^m} h_m(T_i)^{\Delta_i} \exp(-H_m(T_i))
$$
where $M$ is the total number of terminal nodes, $h_m$ and $H_m$ are the (true) hazard and cumulative hazard functions in the $m$th node, and again $L^m$ is the set of observations in terminal node $m$. Estimation of $h_m$ and $H_m$ are described with the associated terminal node prediction below.

The primary advantage of this method is that any off-shelf regression software with a likelihood splitting rule can be utilised without any further adaptation to model fitting by supplying this likelihood with required estimates. However the additional costs of computing these estimates may outweigh the benefits once the likelihood has been calculated, and this could be why only one implementation of this method has been found [@Bou-Hamad2011; @pkgrpart].

#### Other Splitting Rules {.unnumbered .unlisted}

As well as likelihood and log-rank spitting rules, other papers have studied comparison of residuals  [@Therneau1990], scoring rules  [@pkgrfsrc], and distance metrics  [@Gordon1985]. These splitting rules work similarly to the mean squared error in the regression setting, in which the score should be minimised across both leaves. The choice of splitting rule is usually data-dependent and can be treated as a hyper-parameter for tuning. However if there is a clear goal in prediction, then the choice of splitting rule can be informed by the prediction type. For example, if the goal is to maximise separation, then a log-rank splitting rule to maximise homogeneity in terminal nodes is a natural starting point. Whereas if the goal is to estimate the linear predictor of a Cox PH model, then a likelihood splitting rule with a Cox form may be more sensible.

#### Terminal Node Prediction {#sec-surv-ml-models-ranfor-nodes}

Only two terminal node predictions appear in common usage.

#### Predict: Ranking {.unnumbered .unlisted}

Terminal node ranking predictions for survival trees and forests have been limited to those that use a likelihood-based splitting rule and assume a PH model form  [@Ishwaran2004; @LeBlanc1992]. In model fitting the likelihood splitting rule model attempts to fit the (theoretical) PH model $h_m(\tau) = h_0(\tau)\theta_m$ for $m \in 1,...,M$ where $M$ is the total number of terminal nodes and $\theta_m$ is a parameter to estimate. The model returns predictions for $\exp(\hat{\theta}_m)$ where $\hat{\theta}_m$ is the estimate of $\theta_m$. This is estimated via an iterative procedure in which in iteration $j+1$, $\hat{\theta}_m^{j+1}$ is estimated by
$$
\hat{\theta}_m^{j+1} = \frac{\sum_{i \in L^m} \Delta_i}{\sum_{i \in L^m} \hat{H}^j_0(T_i)}
$$
where as before $L^m$ is the set of observations in leaf $m$ and
$$
\hat{H}^{j}_0(\tau) = \frac{\sum_{i:T_i \leq \tau} \Delta_i}{\sum_{m = 1}^M\sum_{\{i:i \in \calR_\tau \cap L^a\}} \hat{\theta}^{j}_m}
$$
which is repeated until some stopping criterion is reached. The same cumulative hazard is estimated for all nodes however $\hat{\theta}_m$ varies across nodes. This method lends itself naturally to a composition to a full distribution (@sec-car) as it assumes a PH form and separately estimates the cumulative hazard and relative risk (@sec-surv-ml-models-ranfor-nov), though no implementation of this composition could be found.

#### Predict: Survival Distribution {.unnumbered .unlisted}

The most common terminal node prediction appears to be predicting the survival distribution by estimating the survival function, using the Kaplan-Meier or Nelson-Aalen estimators, on the sample in the terminal node  [@Hothorn2004; @Ishwaran2008; @LeBlanc1993; @Segal1988]. Estimating a survival function by a non-parametric estimator is a natural choice for terminal node prediction as these are natural 'baselines' in survival, similarly to taking the sample mean in regression. The prediction for SDTs is straightforward, the non-parametric estimator is fit on all observations in each of the terminal nodes. This is adapted to RSFs by bagging the estimator across all decision trees  [@Hothorn2004]. Using the Nelson-Aalen estimator as an example, let $m$ be a terminal node in an SDT, then the terminal node prediction is given by,
$$
\hat{H}_m(\tau) = \sum_{\{i: i \in L^m \cap T_i \leq \tau\}} \frac{d_i}{n_i}
$$ {#eq-surv-nelson}
where $d_i$ and $n_i$ are the number of events and observations at risk at time $T_i$ in terminal node $m$. Ishwaran  [@Ishwaran2008] defined the bootstrapped Nelson-Aalen estimator as
$$
\hat{H}_{Boot}(\tau) = \frac{1}{B} \sum^B_{b=1} \hat{H}_{m, b}(\tau), \quad m \in 1,...,M
$$ {#eq-surv-nelson-boot}
where $B$ is the total number of bootstrapped estimators, $M$ is the number of terminal nodes, and $\hat{H}_{m,b}$ is the cumulative hazard for the $m$th terminal node in the $b$th tree. The bootstrapped Kaplan-Meier estimator is calculated analogously. More generally these can be considered as a uniform mixture of $B$ distributions (@sec-car).

All implemented RSFs can now be summarised into the following five algorithms:

**RRT** {#mod-rrt}\\
LeBlanc and Crowley's (1992)  [@LeBlanc1992] survival decision tree uses a deviance splitting rule with a terminal node ranking prediction, which assumes a PH model form. These 'relative risk trees' (RRTs) are implemented in the package $\pkg{rpart}$  [@pkgrpart]. This model is considered the least accessible and transparent of all discussed in this section as: few implementations exist, it requires assumptions that may not be realistic, and predictions are harder to interpret than other models. Predictive performance of the model is expected to be worse than RSFs as this is a decision tree; this is confirmed in [@Sonabend2021b].

**RRF** {#mod-rrf}\\
Ishwaran $\etal$ (2004)  [@Ishwaran2004] proposed a random forest framework for the relative risk trees, which makes a slight adaptation and applies the iteration of the terminal node prediction after the tree is grown as opposed to during the growing process. No implementation for these 'relative risk forests' (RRFs) could be found or any usage in the literature.

**RSDF-DEV** {#mod-rsdfdev}\\
Hothorn $\etal$ (2004)  [@Hothorn2004] expanded upon the RRT by introducing a bagging composition thus creating a random forest with a deviance splitting rule, again assuming a PH form. However the ranking prediction is altered to be a bootstrapped Kaplan-Meier prediction in the terminal node. This is implemented in $\pkg{ipred}$  [@pkgipred]. This model improves upon the accessibility and transparency of the RRT by providing a more straightforward and interpretable terminal node prediction. However, as this is a decision tree, predictive performance is again expected to be worse than the RSFs.

**RSCIFF** {#mod-rsciff}\\
Hothorn $\etal$  [@Hothorn2005] studied a conditional inference framework in order to predict log-survival time. In this case the splitting rule is based on an IPC weighted loss function, which allows implementation by off-shelf classical random forests. The terminal node predictions are a weighted average of the log-survival times in the node where weighting is determined by the Kaplan-Meier estimate of the censoring distribution. This 'random survival conditional inference framework forest' (RSCIFF) is implemented in $\pkg{party}$  [@pkgparty] and $\pkg{partykit}$  [@pkgpartykit], which additionally includes a distribution terminal node prediction via the bootstrapped Kaplan-Meier estimator. The survival tree analogue (SDCIFT) is implemented in the same packages. Implementation of the RSCIFF is complex, which is likely why all implementations (in the above packages) are by the same authors. The complexity of conditional inference forests may also be the reason why several reviews, including this one, mention (or completely omit) RSCIFFs but do not include any comprehensive details that explain the fitting procedure [@Bou-Hamad2011; @Wang2017a]. In this regard, it is hard to claim that RSCIFFs are transparent or accessible. Moreover the authors of the model state that random conditional inference forests are for `'expert user[s] only and [their] current state is rather experimental''  [@pkgpartykit]. Finally with respect to model performance, there is evidence that they can outperform RSDFs (below) dependent on the data type  [@Nasejje2017] however no benchmark experiment could be found that compared them to other models.

**RSDF-STAT** {#mod-rsdfstat}\\
Finally Ishwaran $\etal$ (2008)  [@Ishwaran2008] proposed the most general form of RSFs with a choice of hypothesis tests (log-rank and log-rank score) and survival measure (Brier, concordance) splitting rules, and a bootstrapped Nelson-Aalen terminal node prediction. These are implemented in $\pkg{randomForestSRC}$  [@pkgrfsrc] and $\pkg{ranger}$  [@pkgranger]. There are several implementations of these models across programming languages, and extensive details for the fitting and predicting procedures, which makes them very accessible. The models utilise a standard random forest framework, which makes them transparent and familiar to those without expert Survival knowledge. Moreover they have been proven to perform well in benchmark experiments, especially on high-dimensional data  [@Herrmann2020; @Spooner2020].

### Conclusions

Random forests are a highly flexible algorithm that allow the various components to be adapted and altered without major changes to the underlying algorithm. The result is that relatively few $\Rstats$ implementations of RSFs cover almost half a century's worth of developments. The only algorithm that does not seem to be implemented is the relative risk forest.

A lack of accessibility, transparency, or proven performance makes RRT and RSDF-DEV a poor choice for model fitting. RSCIFF is potentially a powerful method with promising results in benchmark experiments, but even the authors recognise its complexity prevents it from being accessible. Ishwaran's RSFs on the other hand are more user-friendly and suitable for model fitting and deployment. Simulation studies have demonstrated that RSFs can perform well even with high levels of censoring and there is evidence that on some datasets these can outperform a Cox PH  [@Ishwaran2008].
Ishwaran's model is highly flexible, and its implementation in software packages reflects this. Therefore one can still confidently conclude that random forests are a powerful algorithm in regression, classification, and survival analysis.