MM-challenge

Scripts used for Multiple Myeloma Challenge

Introduction

Multiple myeloma (MM) is a type of blood cancer that affects the plasma cells. In multiple myeloma, malignant plasma cells accumulate in the bone marrow, crowding out the normal plasma cells that help fight infections. These malignant plasma cells then produce abnormal proteins (m protein) which damage the kidneys and interfere with biological processes throughout the body. The malignant plasma cells also increase bone turnover, leaving bones weakened and releasing toxic levels of calcium into the bloodstream.

MM remains a challenging cancer to treat, particularly for a subset of patients with aggressive disease. Despite advances in our understanding of the molecular biology of the disease, identifying patients at high risk of rapid progression remains difficult. Between 15%-25% of multiple myeloma patients will progress (or die) within 18 months of diagnosis regardless of treatment. While risk-adapted therapy is becoming incorporated into treatment protocols, such patients are difficult to confidently identify prior to progression. Therefore, a precise risk stratification model is critical to assist in therapeutic decision-making.

In collaboration with Myeloma Genome Project (MGP), clinical variables, patient outcomes, genetic, and gene expression data from thousands of samples from private and public studies have been curated and harmonized for this challenge. Used stacking to create an ensemble of several baseline classifiers for constructing a precise risk stratification model to identify high risk Multiple Myeloma patients using gene expression variables with ISS and age.

Method:

Applied the same pipeline on each of the four data sets. the four training data sets: GSE24080UAMS, HOVON65, EMTAB and MMRF,

Resampling:

Since the data sets were unbalanced in terms of the number of high risk versus the number of low risk subjects, we used an algorithm that creates datasets with balanced distributions by combining oversampling by SMOTE and undersampling by Tomek.

Feature Selection:

The size of the dataset was reduced by filtering out genes that were not differentially expressed. The R Bioconductor limma package was used for assessing differential gene expression and returned a ranking of the genes. To further perform feature selection, we utilized SVM-RFECV and MRMR approaches [Guyon et al., 2002]. To apply the first method, a support vector machine (SVM) classifier with a linear kernel was trained using all the features. Then, recursive feature elimination (RFE) was applied by using the weight magnitude as a ranking criterion. Smaller subsets (for backward feature elimination) of features were considered by removing lowest ranked features first and cross-validation (CV) was used to determine the optimal number of features. The idea of the second method, Minimum Redundancy and Maximum Relevance (MRMR), is to select a subset of genes so that each gene in the subset has the highest similarity with target vector (maximum relevance), while simultaneously having the lowest similarity with the rest of the genes in the subset (minimum redundancy). Mutual information was used as a measure of similarity. Note that each training data set may have different set of relevant/redundant genes.

First step Classfiers used to Predict the Group:

In order to figure out the right model to use, we need to first decide which group a given test data sample belongs to. We therefore trained classifiers to perform the related tasks with four training sets, GSE24080UAMS, HOVON65, EMTAB and MMRF, respectively.

Second Step Classifiers used to predict the risk level:

The baseline classifiers included Support Vector Machine (SVM), Neural Network (NN), Random Forest (RF), Gradient Boosting Machine (GBM), Learning Vector Quantization (LVQ) and Generalized Linear Model (GLM). The parameters of all baseline models were optimized with cross validation. Then, use a neural network to ensemble above mentioned baseline classifiers.

Test Submission Period:

Extra Step: Fill in missing data in the test set by imputing with averaging.
After that, classify the test samples into the right group by using the first step classifier mentioned above. Depending on the classifed group, apply the appropriate pretrained models to predict its risk level.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
MM_submission_code		MM_submission_code
output		output
test-data		test-data
Data Preparation.R		Data Preparation.R
Dockerfile		Dockerfile
EMC92_UAMS70_ENSEM.R		EMC92_UAMS70_ENSEM.R
EMTAB_ENSEMBLE.R		EMTAB_ENSEMBLE.R
EMTA_Classification.R		EMTA_Classification.R
HOVON_Classification.R		HOVON_Classification.R
HOVON_ENSEMBLE.R		HOVON_ENSEMBLE.R
Joaquin_feature_selected_data.R		Joaquin_feature_selected_data.R
MMRF_DNA.R		MMRF_DNA.R
MMRF_ENSEMBLE.R		MMRF_ENSEMBLE.R
MMRF_ISS.R		MMRF_ISS.R
MM_EDA.ipynb		MM_EDA.ipynb
MM_R.ipynb		MM_R.ipynb
MM_clustering.R		MM_clustering.R
MM_html.nb.html		MM_html.nb.html
Missing data.R		Missing data.R
Naive Bayes.R		Naive Bayes.R
README.md		README.md
RNA_seq.R		RNA_seq.R
RNA_seq_test_global.R		RNA_seq_test_global.R
SMOTE-Tomek_Age_ISS_genes.py		SMOTE-Tomek_Age_ISS_genes.py
SVM_REFV_MRMR_ENSEMBLE.R		SVM_REFV_MRMR_ENSEMBLE.R
UAMS_Classification.R		UAMS_Classification.R
UAMS_ENSEMBLE.R		UAMS_ENSEMBLE.R
UAMS_SURVIVAL.R		UAMS_SURVIVAL.R
UAMS_regression.R		UAMS_regression.R
dataset_pred_model.R		dataset_pred_model.R
global_ensemble_model.R		global_ensemble_model.R
metrics.R		metrics.R
original_test_ensemble.R		original_test_ensemble.R
run-mm-sc2.R		run-mm-sc2.R
run-mm-sc3.R		run-mm-sc3.R
score_sc2.sh		score_sc2.sh
score_sc3.sh		score_sc3.sh

guanhaibin/MM-challenge

Folders and files

Latest commit

History

Repository files navigation

MM-challenge

Introduction

Method:

Resampling:

Feature Selection:

First step Classfiers used to Predict the Group:

Second Step Classifiers used to predict the risk level:

Test Submission Period:

About

Resources

Stars

Watchers

Forks

Languages