This contains the software and data used to create a classifier for invasive bacterial infections based on clinical characteristics. There are three main parts of this process:
- Training the classifier
- Validating the classifier
- Creating honest confidence intervals
If you prefer a pdf-rendered version of this readme, please see the pdf version.
Our classifier is a super learner ensemble of learners, based on the
SuperLearner
package in R
. The super learner uses a library
of
learners that can be trained on the data. We train each learner in the
library and combine them into an ensemble, so that a prediction using
all of the learners in the ensemble is better than any individual one.
This makes the training method more robust to modeling assumptions and
improves the generalizability of the classifier.
The invasive bacterial infection (IBI) outcome is nested under the bacterial infection (BI) outcome, in the sense that all IBI-positive infants are BI-positive, but not all BI-positive infants are IBI-positive. This provides a challenge to training the classifier, since there are very few IBI-positive infants in general. To improve the training process, we used BI outcomes as a surrogate for the IBI outcomes, since there are many more examples of BI-positive infants in the data to work with. To ensure that the classifier actually targeted the IBI outcome rather than the BI outcome, we used observation weights to provide more importance to the
Let be the IBI outcome
we want to predict,
be
the covariates to classify with, and
be the BI surrogate
outcome. We used observation weights
which were created to up-weight infants with
. If a
classifier
would
usually minimize an empirical risk of the form
then our observation weights targeted the slightly different function
The weights were created to up-weight the importance of
and downweight the occurrence of
This allows the weighted classifier to target the occurrence
rather
than
even though
is
used in the training process.
We used cross-validation to estimate the generalization performance of
the super learner classifier. Since super learning itself utilizes a
layer of cross-validation, this means that we used, in effect, a 2-layer
nested cross-validation procedure: the outer layer (examining the
performance of the super learner) used 10 folds for cross-validation
while the inner layer (used to find an optimal combination of the
learner library) used 5 folds. Each layer optimized the weighted
criteria described above. We picked the weights from among 7 different
possibilities on the basis of the cross-validated AUC for predicting
IBI. I.e., we trained 7 different versions of the CV.SuperLearner
with
different weights and picked the one offering the best AUC performance
when applied to the IBI outcome. All of these super learners used the
same cross-validation folds and library of learners. The weight we chose
in practice ended up being the weights that took a value of 8 when
and 1 otherwise. Effectively, this upweighted IBI-positive infants
8-fold in importance compared to other infants. This provided the best
AUC performance, which is a balanced measure of performance among the
IBI-positive and IBI-negative infants. ^20knbg2676io
Since we didn't use another layer of cross-validation to encapsulate the selection of the weights, we used the boostrap bias-corrected cross-validation procedure to provide inference.
Nested cross-validation provided protection from overfitting the
classifier. However, we did not wrap up the weight selection in yet
another layer of cross-validation. This would have been very troublesome
for the analysis, as there were already
different cross-validation folds being created and very few outcomes
that could be used for training in each fold. To get around this, we
used the bootstrap bias-corrected CV procedure. Here is a rough outline
of how this process works:
- Start with the cross-validated super learner predictions
corresponding to one of the seven weight configurations. Call the
predictions
for
- Use the bootstrap to resample the row indices
. Let
be the
resample of the indices.
- For
:
- The resampled outcomes and predictions
can be used to select the AUC-optimal weight configuration
.
- Use the out-of-bootstrap sample to estimate performance on the
selected configuration:
for
the set complement of
. We provide another layer of bootstrap here to resample among the
to provide confidence intervals. Call the resampled
indices
.
- Let
be one of the test characteristics we want to provide inference for, e.g. the Sensitivity or Specificity for some predictions
and true class values
. In the
iteration, return
- The resampled outcomes and predictions
- Use the percentile bootstrap method to create confidence intervals
for the test statistic
I.e. use quantiles of the 50,000 values of
to build confidence intervals.