Analysis pipeline for the precisionFDA Brain Cancer Predictive Modeling and Biomarker Discovery challenge using msaenet.
It is ranked as the 2nd place solution by predictive performance.
Team: Nan Xiao, Soner Koc, Kaushik Ghose from Seven Bridges.
Model
This solution features the following models:
- Feature selection with the multi-step adaptive SCAD-net method (Xiao and Xu, 2015).
- A relaxed version of the "Stability Selection" procedure (Meinshausen and Bühlmann, 2010) was used to aggregate the selected features from 100 perturbated models and only keep the consistently selected features.
- Gradient boosting decision tree (GBDT) models for predictive modeling with the selected genomic features and all four clinical features. The tree models include xgboost (Chen and Guestrin, 2016), lightgbm (Ke et al., 2017), catboost (Prokhorenkova et al., 2018), and a two-layer stacking tree model (Wolpert, 1992). We created an R package stackgbm for doing this after the challenge ended.
Pipeline
Dependencies
Most of the depended R packages are installable from CRAN. Two special ones:
- lightgbm: install from source. For macOS, it is advised to compile with a Homebrew gcc toolchain instead of the default LLVM toolchain.
- catboost: install the latest compiled binary package from their GitHub releases.
Reproducibility
Open run.R
and follow the steps. Note that some steps could take a few hours to run despite the fact that they are fully parallelized.