/
meas_rules.qmd
243 lines (175 loc) · 19.5 KB
/
meas_rules.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
abstract: TODO (150-200 WORDS)
---
{{< include _setup.qmd >}}
# Evaluating Distributions by Scoring Rules {#sec-eval-distr}
{{< include _wip_minor.qmd >}}
Scoring rules evaluate probabilistic predictions and (attempt to) measure the overall predictive ability of a model in terms of both calibration and discrimination [@Gneiting2007; @Murphy1973].
In contrast to calibration measures, which assess the average performance across all observations on a population level, scoring rules evaluate the sample mean of individual predictions across all observations in a test set.
As well as being able to provide information at an individual level, scoring rules are also popular as probabilistic forecasts are widely recognised to be superior to deterministic predictions for capturing uncertainty in predictions [@Dawid1984; @Dawid1986].
Formalisation and development of scoring rules has primarily been due to Dawid [@Dawid1984; @Dawid1986; @Dawid2014] and Gneiting and Raftery [@Gneiting2007]; though the earliest measures promoting "rational" and "honest" decision making date back to the 1950s [@Brier1950; @Good1952].
Few scoring rules have been proposed in survival analysis, although the past few years have seen an increase in popularity in these measures.
Before delving into these measures, we will first describe scoring rules in the simpler classification setting.
Scoring rules are pointwise losses, which means a loss is calculated for all observations and the sample mean is taken as the final score.
To simplify notation, we only discuss scoring rules in the context of a single observation where $L_i(\hatS_i, t_i, \delta_i)$ would be the loss calculated for some observation $i$ where $\hatS_i$ is the predicted survival function (from which other distribution functions can be derived), and $(t_i, \delta_i)$ is the observed survival outcome.
## Classification Losses
In the simplest terms, a scoring rule compares two values and assigns them a score (hence 'scoring rule'), formally we'd write $L: \Reals \times \Reals \mapsto \ExtReals$.
In machine learning, this usually means comparing a prediction for an observation to the ground truth, so $L: \Reals \times \calP \mapsto \ExtReals$ where $\calP$ is a set of distributions.
Crucially, scoring rules usually refer to comparisons of true and predicted *distributions*.
As an example, take the Brier score [@Brier1950] defined by:
$$L_{Brier}(\hatp_i, y_i) = (y_i - \hatp_i(y_i))^2$$
This scoring rule compares the ground truth to the predicted probability distribution by testing if the difference between the observed event and the truth is minimized.
This is intuitive as if the event occurs and $y_i = 1$, then $\hatp_i(y_i)$ should be as close to one as possible to minimize the loss.
On the other hand, if $y_i = 0$ then the better prediction would be $\hatp_i(y_i) = 0$.
This demonstrates an important property of the scoring rule, *properness*.
A loss is *proper*, if it is minimized by the correct prediction.
In contrast, the loss $L_{improper}(\hatp_i, y_i) = 1 - L_{Brier}(\hatp_i, y_i)$ is still a scoring rule as it compares the ground truth to the prediction probability distribution, but it is clearly improper as the perfect prediction ($\hatp_i(y_i) = y_i$) would result in a score of $1$ whereas the worst prediction would result in a score or $0$.
Proper losses provide a method of model comparison as, by definition, predictions closest to the true distribution will result in lower expected losses.
Another important property is *strict* properness.
A loss is *strictly proper* if the loss is uniquely minimized by the 'correct' prediction.
Consider now the loss $L_0(\hatp_i, y_i) = 0$.
Not only is this a strictly proper scoring rule but it is also proper.
The loss can only take the value $0$ and is therefore guaranteed to be minimized by the correct prediction.
It is clear however that this loss is useless.
In contrast, the Brier score is minimized by only one value, which is the optimal prediction ([@fig-eval-brierlog]).
Strictly proper losses are particular important for automated model optimisation, as minimization of the loss will result in the 'optimum score estimator based on the scoring rule' [@Gneiting2007].
Mathematically, a classification loss $L: \calP \times \calY \rightarrow \ExtReals$ is *proper* if for any distributions $p_Y,p$ in $\calP$ and for any random variables $Y \sim p_Y$, it holds that $\EE[L(p_Y, Y)] \leq \EE[L(p, Y)]$.
The loss is *strictly proper* if, in addition, $p = p_Y$ uniquely minimizes the loss.
As well as the Brier score, which was defined above, another widely used loss is the log loss [@Good1952], defined by
$$
L_{logloss}(\hatp_i, y_i) = -\log \hat{p}_i(y_i)
$$
These losses are visualised in @fig-eval-brierlog, which highlights that both losses are strictly proper [@Dawid2014] as they are minimized when the true prediction is made, and converge to the minimum as predictions are increasingly improved.
![Brier and log loss scoring rules for a binary outcome and varying probabilistic predictions. x-axis is a probabilistic prediction in $[0,1]$, y-axis is Brier score (left) and log loss (right). Blue lines are varying Brier score/log loss over different predicted probabilities when the true outcome is 1. Red lines are varying Brier score/log loss over different predicted probabilities when the true outcome is 0. Both losses are minimized when $\hatp_i(y_i) = y_i$.](Figures/evaluation/brier_logloss.png){#fig-eval-brierlog fig-alt="TODO"}
## Survival Losses {#sec-eval-distr-commonsurv}
<!-- FIXME: COPY FINAL DEFINITION FROM PROPERNESS PAPER -->
Analogously to classification losses, a survival loss $L: \calP \times \PReals \times \bset \rightarrow \ExtReals$ is *proper* if for any distributions $p_Y, p$ in $\calP$, and for any random variables $Y \sim p_Y$, and $C$ t.v.i. $\PReals$; with $T := \min(Y,C)$ and $\Delta := \II(T=Y)$; it holds that, $\EE[L(p_Y, T, \Delta)] \leq \EE[L(p, T, \Delta)]$.
The loss is *strictly proper* if, in addition, $p = p_Y$ uniquely minimizes the loss. A survival loss is referred to as outcome-independent (strictly) proper if it is only (strictly) proper when $C$ and $Y$ are independent.
With these definitions, the rest of this chapter lists common scoring rules in survival analysis and discusses some of their properties.
As with other chapters, this list is likely not exhaustive but will cover commonly used losses.
### Integrated Graf Score
The Integrated Graf Score (IGS) was introduced by Graf [@Graf1995; @Graf1999] as an analogue to the integrated brier score (IBS) in regression.
It is likely the commonly used scoring rule in survival analysis, possibly due to its intuitive interpretation.
The loss is defined by
$$
\begin{split}
L_{IGS}(\hat{S}_i, t_i, \delta_i|\KMG) = \int^{\tau^*}_0 \frac{\hatS_i^2(\tau) \II(t_i \leq \tau, \delta_i=1)}{\KMG(t_i)} + \frac{\hatF_i^2(\tau) \II(t_i > \tau)}{\KMG(\tau)} \ d\tau
\end{split}
$$ {#eq-igs}
where $\hatS_i^2(\tau) = (\hatS_i(\tau))^2$ and $\hatF_i^2(\tau) = (1 - \hatS_i(\tau))^2$, and $\tau^* \in \NNReals$ is an upper threshold to compute the loss up to, and $\KMG$ is the Kaplan-Meier trained on the censoring distribution for IPCW (@sec-eval-crank-disc-conc).
At first glance this might seem intimidating but it is worth taking the time to understand the intuition behind the loss.
Recall the classification Brier score, $L(\hatp_i, y_i) = (y_i - \hatp_i(y))^2$, this provides a method to compare and evaluate a probability mass function at one time-point.
The *integrated Brier score* (IBS), also known as the *CRPS*, is the integral of the Brier score for all real-valued thresholds [@Gneiting2007] and hence allows predictions to be evaluated over multiple points as $L(\hatF_i, y_i) = \int (\hatF_i(y_i) - \II(y_i \geq x))^2 dy$ where $\hatF_i$ is the predicted cumulative distribution function and $x$ is some meaningful threshold.
In survival analysis, $\hatF_i(\tau)$ represents the probability of an observation having experienced the event at or before $\tau$, and the ground truth to compare to is therefore whether the observation has actually experienced the event at $\tau$, which is the case when $t_i \leq \tau$.
Hence the IBS becomes $L(\hatF_i, t_i) = \int (\hatF_i(\tau) - \II(t_i \leq \tau))^2 d\tau$.
Now for a given time $\tau$:
$$
L(\hatF_i, t) =
\begin{cases}
(\hatF_i(\tau) - 1)^2 = (1 - \hatF_i(\tau)^2) = \hatS_i^2(\tau), & \text{ if } t_i \leq \tau \\
(\hatF_i(\tau) - 0)^2 = \hatF_i^2(\tau), & \text{ if } t_i > \tau
\end{cases}
$$ {#eq-graf-cases}
In words, if an observation has not yet experienced an outcome ($t_i > \tau$) then the loss is minimized when the cumulative distribution function (the probability of having already died) is $0$, which is intuitive as the optimal prediction is correctly identifying the observation has yet to experience the event.
In contrast, if the observation has experienced the outcome ($t_i \leq \tau$) then the loss is minimized when the survival function (the probability of surviving until $\tau$) is $0$, which follows from similar logic.
The final component of the Graf score is accommodating for censoring.
At $\tau$ an observation will either have
1. Not experienced the event: $I(t_i > \tau)$;
2. Experienced the event: $I(t_i \leq \tau, \delta_i = 1)$; or
3. Been censored: $I(t_i \leq \tau, \delta_i = 0)$
In the Graf score, censored observations are discarded.
If they were not then @eq-graf-cases would imply their contribution would be treated the same as those who had experienced the event.
However this assumption would be entirely wrong as a censored observation is guaranteed not to have experienced the event, hence an ideal prediction for a censored observation is a high survival probability up until the point of censoring, at which time comparison to ground truth is unknown as this is no longer observed.
The act of discarding censored observations means that the sample size decreases over time.
To compensate for this, IPCW is used to increasingly weight predictions as $\tau$ increases.
Hence, IPCW weights, $W_i$ are defined such that
$$
W_i =
\begin{cases}
\KMG^{-1}(t_i), & \text{ if } \II(t_i \leq \tau, \delta_i = 1) \\
\KMG^{-1}(\tau), & \text{ if } \II(t_i > \tau)
\end{cases}
$$
The weights total $1$ when divided over all samples and summed [@Graf1999].
They are also intuitive as observations are either weighted by $G(\tau)$ when they are still alive and therefore still part of the sample, or by $G(t_i)$ otherwise.
As well as being intuitive, when censoring is uninformative, the Graf score consistently estimates the mean square error $L(t, S|\tau^*) = \int^{\tau^*}_0 [\II(t > \tau) - S(\tau)]^2 d\tau$, where $S$ is the correctly specified survival function [@Gerds2006].
However, despite these promising properties, the IGS is improper and must therefore be used with care [@Rindt2022; @Sonabend2022b].
In practice, experiments have shown that the effect of improperness is minimal and therefore this loss should be avoided for automated routines such as model tuning, however it can still be used for model evaluation.
<!-- CITE OUR PROPERNESS PAPER HERE -->
In addition, a small adaptation of the loss results in a strictly proper scoring rule simply by altering the weights such that $W_i = \KMG^{-1}(t_i)$ for all uncensored observations and $0$ otherwise [@Sonabend2022b], resulting in the reweighted Graf score:
<!-- CITE OUR PROPERNESS PAPER HERE -->
$$
L_{RGS}(\hatS_i, t_i, \delta_i|\KMG) = \delta_i\II(t_i \leq \tau^*) \int^{\tau^*}_0 \frac{(\II(t_i \leq \tau) - \hatF_i(\tau))^2}{\KMG(t_i)} \ d\tau
$$
The addition of $\II(t_i \leq \tau^*)$ completely removes observations that experience the event after the cutoff time, $\tau^*$, this ensures there are no cases where the $G(t_i)$ weighting is calculated on time after the cutoff.
Including an upper threshold (i.e, $\tau^* < \infty$) effects properness and generalization statements.
For example, by evaluating a model using the RGS with a $\tau^*$ threshold, then the model may be said to only perform well up until $\tau^*$ with its performance unknown after this time.
The change of weighting slightly alters the interpretation of the contributions at different time-points.
By example, let $(t_i = 4, t_j = 5)$ be two observed survival times, then at $\tau = 3$, the Graf score weighting would be $\KMG^{-1}(4)$ for both observations, whereas the RGS weights would be $(KMG^{-1}(4), KMG^{-1}(5))$ respectively, hence there is always more 'importance' placed on observations that take longer to experience the event.
In practice, the difference between these weights appears to be minimal [@Sonabend2022b] but as RGS is strictly proper, it is more suitable for automated experiments.
<!-- FIXME: CITAITON ABOVE TO PROPERNESS PAPER -->
### Integrated Survival Log Loss
<!-- FIXME - SHOULD WE JUST DELETE? -->
The integrated survival log loss (ISLL) was also proposed by @Graf1999.
$$
L_{ISLL}(\hatS_i,t_i,\delta_i|\KMG) = -\int^{\tau^*}_0 \frac{\log[\hatF_i(\tau)] \II(t_i \leq \tau, \delta_i=1)}{\KMG(t_i)} + \frac{\log[\hatS_i(\tau)] \II(t_i > \tau)}{\KMG(\tau)} \ d\tau
$$
where $\tau^* \in \PReals$ is an upper threshold to compute the loss up to.
Similarly to the IGS, there are three ways to contribute to the loss depending on whether an observation is censored, experienced the event, or alive, at $\tau$.
Whilst the IGS is routinely used in practice, there is no evidence that ISLL is used, and moreover there are no proofs (or claims) that it is proper.
The reweighted ISLL (RISLL) follows similarly to the RIGS and is also outcome-independent strictly proper [@Sonabend2022b].
$$
L_{RISLL}(\hatS_i, t_i, \delta_i|\KMG) = -\delta_i\II(t_i \leq \tau^*) \int^{\tau^*}_0 \frac{\II(t_i \leq \tau)\log[\hatF_i(\tau)] + \II(t_i > \tau)\log[\hatS_i(\tau)] \ d\tau}{\KMG(t_i)}
$$
### Survival density log loss
Another outcome-independent strictly proper scoring rule is the survival density log loss (SDLL) [@Sonabend2022b], which is given by
$$
L_{SDLL}(\hatf_i, t_i, \delta_i|\KMG) = - \frac{\delta_i \log[\hatf_i(t_i)]}{\KMG(t_i)}
$$
where $\hatf_i$ is the predicted probability density function.
This loss is essentially the classification log loss ($-\log(\hatp_i(t_i))$) with added IPCW.
Whilst the classification log loss has beneficial properties such as being differentiable, this is more complex for the SDLL and it is not widely used in practice.
A useful alternative to the SDLL which can be readily used in automated procedures is the right-censored log loss.
### Right-censored log loss
The right-censored log loss (RCLL) is an outcome-independent strictly proper scoring rule [@Avati2020] that benefits from not depending on IPCW in its construction.
The RCLL is defined by
$$
L_{RCLL}(\hatS_i, t_i, \delta_i) = -\log[\delta_i\hatf_i(t_i) + (1-\delta_i)\hatS_i(t_i)]
$$
This loss is interpretable when we break it down into its two halves:
1. If an observation is censored at $t_i$ then all the information we have is that they did not experience the event at the time, so they must be 'alive', hence the optimal value is $\hatS_i(t_i) = 1$ (which becomes $-log(1) = 0$).
2. If an observation experiences the event then the 'best' prediction is for the probability of the event at that time to be maximised, as pdfs are not upper-bounded this means $\hatf_i(t_i) = \infty$ (and $-log(t_i) \rightarrow \infty$ as $t_i \rightarrow \infty$).
### Absolute Survival Loss
The absolute survival loss, developed over time by @Schemper2000 and @Schmid2011, is based on the mean absolute error is very similar to the IGS but removes the squared term:
$$
L_{ASL}(\hatS_i, t_i, \delta_i|\KMG) = \int^{\tau^*}_0 \frac{\hatS_i(\tau)\II(t_i \leq \tau, \delta_i = 1)}{\KMG(t_i)} + \frac{\hatF_i(\tau)\II(t_i > \tau)}{\KMG(\tau)} \ d\tau
$$
where $\KMG$ and $\tau^*$ are as defined above.
Analogously to the IGS, the ASL score consistently estimates the mean absolute error when censoring is uninformative [@Schmid2011] but there are also no proofs or claims of properness.
The ASL and IGS tend to yield similar results [@Schmid2011] but in practice there is no evidence of the ASL being widely used.
<!-- FIXME - ADD FURTHER SCORING RULES HERE -->
## Prediction Error Curves {#sec-pecs}
<!-- FIXME - UNCHANGED FROM THESIS. I THINK FINE FOR NOW BUT NEEDS TO BE REWRITTEN IN THE FUTURE (MAINLY TO AVOID SELF PLAGIARISM) -->
As well as evaluating probabilistic outcomes with integrated scoring rules, non-integrated scoring rules can be utilised for evaluating distributions at a single point.
For example, instead of evaluating a probabilistic prediction with the IGS over $\NNReals$, instead one could compute the IGS at a single time-point, $\tau \in \NNReals$, only.
Plotting these for varying values of $\tau$ results in 'prediction error curves' (PECs), which provide a simple visualisation for how predictions vary over the outcome.
PECs are especially useful for survival predictions as they can visualise the prediction 'over time'.
PECs are mostly used as a graphical guide when comparing few models, rather than as a formal tool for model comparison.
An example for PECs is provided in @fig-eval-pecs for the IGS where the the Cox PH consistently outperforms the SVM.
![Prediction error curves for the CPH and SVM models from @sec-eval-distr-calib. x-axis is time and y-axis is the IGS computed at different time-points. The CPH (red) performs better than the SVM (blue) as it scores consistently lower. Trained and tested on randomly simulated data from $\proba$.](Figures/evaluation/pecs.png){#fig-eval-pecs fig-alt="TODO"}
## Baselines and ERV {#sec-eval-distr-score-base}
A common criticism of scoring rules is a lack of interpretability, for example, an IGS of 0.5 or 0.0005 has no meaning by itself, so below we present two methods to help overcome this problem.
The first method, is to make use of baselines for model comparison, which are models or values that can be utilised to provide a reference for a loss, they provide a universal method to judge all models of the same class by [@Gressmann2018].
In classification, it is possible to derive analytical baseline values, for example a Brier score is considered 'good' if it is below 0.25 or a log loss if it is below 0.693 (@fig-eval-brierlog), this is because these are the values obtained if you always predicted probabilties as $0.5$, which is a reasonable basline guess in a binary classificaiton problem.
In survival analysis, simple analytical expressions are not possible as losses are dependent on the unknown distributions of both the survival and censoring time.
Therefore all experiments in survival analysis must include a baseline model that can produce a reference value in order to derive meaningful results.
A suitable baseline model is the Kaplan-Meier estimator [@Graf1995; @Lawless2010; @Royston2013], which is the simplest model that can consistently estimate the true survival function.
As well as directly comparing losses from a 'sophisticated' model to a baseline, one can also compute the percentage increase in performance between the sophisicated and baseline models, which produces a measure of explained residual variation (ERV) [@Korn1990; @Korn1991].
For any survival loss $L$, the ERV is,
$$
R_L(S, B) = 1 - \frac{L|S}{L|B}
$$
where $L|S$ and $L|B$ is the loss computed with respect to predictions from the sophisticated and baseline models respectively.
The ERV interpretation makes reporting of scoring rules easier within and between experiments.
For example, say in experiment A we have $L|S = 0.004$ and $L|B = 0.006$, and in experiment B we have $L|S = 4$ and $L|B = 6$.
The sophisticated model may appear worse at first glance in experiment A (as the losses are very close) but when considering the ERV we see that the performance increase is identical (both $R_L = 33\%$), thus providing a clearer way to compare models.