Pediatric IBI 2022

This contains the software and data used to create a classifier for invasive bacterial infections based on clinical characteristics. There are three main parts of this process:

Training the classifier
Validating the classifier
Creating honest confidence intervals

If you prefer a pdf-rendered version of this readme, please see the pdf version.

Training the classifier

Our classifier is a super learner ensemble of learners, based on the SuperLearner package in R. The super learner uses a library of learners that can be trained on the data. We train each learner in the library and combine them into an ensemble, so that a prediction using all of the learners in the ensemble is better than any individual one. This makes the training method more robust to modeling assumptions and improves the generalizability of the classifier.

The invasive bacterial infection (IBI) outcome is nested under the bacterial infection (BI) outcome, in the sense that all IBI-positive infants are BI-positive, but not all BI-positive infants are IBI-positive. This provides a challenge to training the classifier, since there are very few IBI-positive infants in general. To improve the training process, we used BI outcomes as a surrogate for the IBI outcomes, since there are many more examples of BI-positive infants in the data to work with. To ensure that the classifier actually targeted the IBI outcome rather than the BI outcome, we used observation weights to provide more importance to the

Mathematical description

Let be the IBI outcome we want to predict, be the covariates to classify with, and be the BI surrogate outcome. We used observation weights which were created to up-weight infants with . If a classifier would usually minimize an empirical risk of the form

$R^{\dagger}(f) = \frac{1}{n} \sum_{i=1}^n L\big\{ f(X_i); Z_i \big\},$

then our observation weights targeted the slightly different function

$R(f) = \frac{1}{n} \sum_{i=1}^n w(Z_i, Y_i) L\big\{ f(X_i); Z_i \big\}.$

The weights were created to up-weight the importance of and downweight the occurrence of This allows the weighted classifier to target the occurrence rather than even though is used in the training process.

Validating the Classifier

We used cross-validation to estimate the generalization performance of the super learner classifier. Since super learning itself utilizes a layer of cross-validation, this means that we used, in effect, a 2-layer nested cross-validation procedure: the outer layer (examining the performance of the super learner) used 10 folds for cross-validation while the inner layer (used to find an optimal combination of the learner library) used 5 folds. Each layer optimized the weighted criteria described above. We picked the weights from among 7 different possibilities on the basis of the cross-validated AUC for predicting IBI. I.e., we trained 7 different versions of the CV.SuperLearner with different weights and picked the one offering the best AUC performance when applied to the IBI outcome. All of these super learners used the same cross-validation folds and library of learners. The weight we chose in practice ended up being the weights that took a value of 8 when and 1 otherwise. Effectively, this upweighted IBI-positive infants 8-fold in importance compared to other infants. This provided the best AUC performance, which is a balanced measure of performance among the IBI-positive and IBI-negative infants. ^20knbg2676io

Since we didn't use another layer of cross-validation to encapsulate the selection of the weights, we used the boostrap bias-corrected cross-validation procedure to provide inference.

Creating honest confidence intervals

Nested cross-validation provided protection from overfitting the classifier. However, we did not wrap up the weight selection in yet another layer of cross-validation. This would have been very troublesome for the analysis, as there were already $5 \times 10$ different cross-validation folds being created and very few outcomes that could be used for training in each fold. To get around this, we used the bootstrap bias-corrected CV procedure. Here is a rough outline of how this process works:

Start with the cross-validated super learner predictions corresponding to one of the seven weight configurations. Call the predictions $\hat{Y}_w$ for $w=1,\ldots,7.$
Use the bootstrap to resample the row indices $\{1,\ldots,n\}$ . Let be the $b^{th}$ resample of the indices.
For $b=1,\ldots, ~50,000$ :
- The resampled outcomes and predictions $Y[I_b], \hat{Y}_w[I_b]$ can be used to select the AUC-optimal weight configuration $\hat{w}_b$ .
- Use the out-of-bootstrap sample to estimate performance on the selected configuration: $\hat{Y}_{\hat{w}_b}[I_b^c]$ for the set complement of . We provide another layer of bootstrap here to resample among the to provide confidence intervals. Call the resampled indices .
- Let $T(\hat{Y}, Y)$ be one of the test characteristics we want to provide inference for, e.g. the Sensitivity or Specificity for some predictions $\hat{Y}$ and true class values . In the $b^{th}$ iteration, return $T \bigg(\hat{Y}_{\hat{w}_b}[J_b], Y[J_b] \bigg)$
Use the percentile bootstrap method to create confidence intervals for the test statistic I.e. use quantiles of the 50,000 values of $T \bigg(\hat{Y}_{\hat{w}_b}[J_b], Y[J_b] \bigg)$ to build confidence intervals.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
R		R
data		data
renv		renv
results		results
.Rprofile		.Rprofile
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.md.orig		README.md.orig
README.pdf		README.pdf
boot-report.sh		boot-report.sh
pediatric-ibi-2022.Rproj		pediatric-ibi-2022.Rproj
renv.lock		renv.lock
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pediatric IBI 2022

Training the classifier

Mathematical description

Validating the Classifier

Creating honest confidence intervals

About

Releases

Languages

License

jmiahjones/pediatric-ibi-2022

Folders and files

Latest commit

History

Repository files navigation

Pediatric IBI 2022

Training the classifier

Mathematical description

Validating the Classifier

Creating honest confidence intervals

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages