Jupyter notebook 
-------

This notebook illustrates the R codes to compare LASSO regression results with those results derived by Boruta algorithm used in the paper **"Data independent acquisition mass spectrometry in severe Rheumatic Heart Disease (RHD) identifies a proteomic signature showing ongoing inflammation and effectively classifying RHD cases"**

Author: **Jing Yang**

Date: **17/11/2021**

Contact: Jing.Yang@manchester.ac.uk

In [1]:
library(caret)
library(data.table)
library(tidyverse)
library(glmnet)
library(Boruta)
library(corrplot)
library(DescTools)
library(pROC)

Loading required package: ggplot2

Loading required package: lattice

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mtibble [39m 3.1.5     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.2     [32m✔[39m [34mforcats[39m 0.5.1
[32m✔[39m [34mpurrr  [39m 0.3.4     

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mbetween()[39m   masks [34mdata.table[39m::between()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m    masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mfirst()[39m     masks [34mdata.table[39m::first()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m       masks [34mstats[39m::lag()
[31m✖[39m [34mdplyr[39m::[32mlast()[39m      masks [34mdata.table[39m::last()
[31m✖[39m [34mpurrr[39m::[32mlift()[39m      mas

### load log2 scaled protein expression data

In [2]:
data <- read.csv(file='Data/RHD_data_filtered.csv')
data[is.na(data)] <- 0
data$Group <- as.factor(data$Group)

### separate the data to training and testing dataset

In [3]:
set.seed(1)
trainIndex <- createDataPartition(data$Group, p=0.7, list=FALSE)
trainData <- data[trainIndex,] %>% select(-StollerID)
testData <- data[-trainIndex,] %>% select(-StollerID)


### load Boruta results

In [4]:
load(file='Data/Boruta_results_2108.RData')
result_allsample <- attStats(Boruta.allsample) %>% filter(decision %in% 'Confirmed') %>% mutate(UniProtID=rownames(.)) %>% arrange(desc(medianImp))
proteins_confirmed <- result_allsample$UniProtID

In [5]:
fitControl = trainControl(method = "repeatedcv",
                          classProbs = TRUE,
                          number = 10,
                          repeats = 5, 
                          summaryFunction = twoClassSummary,
                          verboseIter = FALSE)

In [6]:
#boruta.formula <- formula(paste("Group ~ ", paste(proteins_confirmed, collapse = " + ")))
rfBoruta.fit <- train(Group ~ ., 
                      data = trainData,
                      trControl = fitControl,
                      tuneLength = 4,  # final value was mtry = 4
                      method = "rf",
                      metric = "ROC")
print(rfBoruta.fit$finalModel)


Call:
 randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 123

        OOB estimate of  error rate: 15.38%
Confusion matrix:
        Case Control class.error
Case     116      35  0.23178808
Control   13     148  0.08074534


In [7]:
confusionMatrix(predict(rfBoruta.fit$finalModel, type='response'), trainData$Group)

Confusion Matrix and Statistics

          Reference
Prediction Case Control
   Case     116      13
   Control   35     148
                                          
               Accuracy : 0.8462          
                 95% CI : (0.8012, 0.8843)
    No Information Rate : 0.516           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6906          
                                          
 Mcnemar's Test P-Value : 0.002437        
                                          
            Sensitivity : 0.7682          
            Specificity : 0.9193          
         Pos Pred Value : 0.8992          
         Neg Pred Value : 0.8087          
             Prevalence : 0.4840          
         Detection Rate : 0.3718          
   Detection Prevalence : 0.4135          
      Balanced Accuracy : 0.8437          
                                          
       'Positive' Class : Case            
               

### show performance of Boruta results in training and testing data

In [14]:
confusionMatrix(trainData$Group, predict(rfBoruta.fit, newdata = trainData[,1:366], type = "raw"))

confusionMatrix(testData$Group, predict(rfBoruta.fit, newdata = testData[,1:366], type = "raw"))

Confusion Matrix and Statistics

          Reference
Prediction Case Control
   Case     151       0
   Control    0     161
                                     
               Accuracy : 1          
                 95% CI : (0.9882, 1)
    No Information Rate : 0.516      
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.000      
            Specificity : 1.000      
         Pos Pred Value : 1.000      
         Neg Pred Value : 1.000      
             Prevalence : 0.484      
         Detection Rate : 0.484      
   Detection Prevalence : 0.484      
      Balanced Accuracy : 1.000      
                                     
       'Positive' Class : Case       
                                     

Confusion Matrix and Statistics

          Reference
Prediction Case Control
   Case      50      14
   Control    4      65
                                          
               Accuracy : 0.8647          
                 95% CI : (0.7946, 0.9178)
    No Information Rate : 0.594           
    P-Value [Acc > NIR] : 8.86e-12        
                                          
                  Kappa : 0.7274          
                                          
 Mcnemar's Test P-Value : 0.03389         
                                          
            Sensitivity : 0.9259          
            Specificity : 0.8228          
         Pos Pred Value : 0.7812          
         Neg Pred Value : 0.9420          
             Prevalence : 0.4060          
         Detection Rate : 0.3759          
   Detection Prevalence : 0.4812          
      Balanced Accuracy : 0.8744          
                                          
       'Positive' Class : Case            
               

### LASSO regression

In [15]:
lambdas <- 10^seq(2,-3,by=-0.1)

In [16]:
lasso_trainX <- as.matrix(trainData[,1:366])
lasso_trainy <- trainData$Group
lasso_testX <- as.matrix(testData[,1:366])
lasso_testy <- testData$Group

levels(lasso_trainy) <- c(1,0)
levels(lasso_testy) <- c(1,0)

In [17]:
lasso_reg <- cv.glmnet(lasso_trainX, lasso_trainy, alpha = 1, family = 'binomial' , lambda = lambdas, type.measure = 'deviance' , standardise=TRUE, nfolds = 4)

In [18]:
lambda_best <- lasso_reg$lambda.min

In [19]:
lasso_model <- glmnet(lasso_trainX, lasso_trainy, alpha = 1, lambda = lambda_best, family='binomial')
predictions_train <- as.factor(predict(lasso_model, s = lambda_best, newx = lasso_trainX,'class'))
#levels(predictions_train) <- levels(lasso_trainy)

predictions_test <- as.factor(predict(lasso_model, s = lambda_best, newx = lasso_testX,'class'))
#levels(predictions_test) <- levels(lasso_testy)


### show prediction performance of lasso classificatin in training and testing data

In [20]:
confusionMatrix(lasso_trainy, predictions_train)
confusionMatrix(lasso_testy, predictions_test)

“Levels are not in the same order for reference and data. Refactoring data to match.”


Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 158   3
         1  14 137
                                          
               Accuracy : 0.9455          
                 95% CI : (0.9142, 0.9679)
    No Information Rate : 0.5513          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8907          
                                          
 Mcnemar's Test P-Value : 0.01529         
                                          
            Sensitivity : 0.9186          
            Specificity : 0.9786          
         Pos Pred Value : 0.9814          
         Neg Pred Value : 0.9073          
             Prevalence : 0.5513          
         Detection Rate : 0.5064          
   Detection Prevalence : 0.5160          
      Balanced Accuracy : 0.9486          
                                          
       'Positive' Class : 0               
                              

“Levels are not in the same order for reference and data. Refactoring data to match.”


Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 63  6
         1 13 51
                                          
               Accuracy : 0.8571          
                 95% CI : (0.7859, 0.9117)
    No Information Rate : 0.5714          
    P-Value [Acc > NIR] : 1.23e-12        
                                          
                  Kappa : 0.7127          
                                          
 Mcnemar's Test P-Value : 0.1687          
                                          
            Sensitivity : 0.8289          
            Specificity : 0.8947          
         Pos Pred Value : 0.9130          
         Neg Pred Value : 0.7969          
             Prevalence : 0.5714          
         Detection Rate : 0.4737          
   Detection Prevalence : 0.5188          
      Balanced Accuracy : 0.8618          
                                          
       'Positive' Class : 0               
                                    