## Boosting: the weak learning perspective
### A basic boosting model

In [8]:
library("gbm")
library("vip")
library("caret")
library("xgboost")
library("tidymodels")
library("data.table")
library("randomForest")

We start by reading in the diabetes data and splitting it into the training and test sets: again, it's a **multiclass classification problem** (combination of gender and health status)

In [9]:
## read the data
mtbsl1 <- fread("../data/MTBSL1.tsv")
names(mtbsl1)[c(4:ncol(mtbsl1))] <- paste("mtbl",seq(1,ncol(mtbsl1)-3), sep = "_")
mtbsl1$gender_status <- paste(mtbsl1$Gender,mtbsl1$Metabolic_syndrome,sep="_")
diab_dt <- select(mtbsl1, -c(`Primary ID`, Gender, Metabolic_syndrome))

In [10]:
## DATA SPLITTING

diab_dt$id <- paste("id",seq(1,nrow(diab_dt)), sep="_")

training_set <- diab_dt %>%
  group_by(gender_status) %>%
  sample_frac(size = 0.7)

test_recs <- !(diab_dt$id %in% training_set$id)
test_set <- diab_dt[test_recs,]

training_set$id <- NULL
test_set$id <- NULL

table(training_set$gender_status)
table(test_set$gender_status)


    Female_Control Group Female_diabetes mellitus       Male_Control Group 
                      20                       18                       39 
  Male_diabetes mellitus 
                      15 


    Female_Control Group Female_diabetes mellitus       Male_Control Group 
                       8                        8                       17 
  Male_diabetes mellitus 
                       7 

We now use the `gbm` function from the *gbm* package:

- equation: gender_status as a function of all metabolites
- distribution: **multinomial** (4 classes)
- n.trees: total number of trees (n. of sequential models to be combined/added)
- shrinkage: $\lambda$ (shrinkage) parameter
- interaction.depth: maximum depth of trees

In [11]:
boost.diabt = gbm(
  gender_status ~ ., 
  data=training_set, 
  distribution="multinomial",
  n.trees=1000, ## B parameter
  shrinkage=0.01, ## (learning rate, or step-size)
  interaction.depth=2 ## d parameter 
)

print(boost.diabt)

“Setting `distribution = "multinomial"` is ill-advised as it is currently broken. It exists only for backwards compatibility. Use at your own risk.”


gbm(formula = gender_status ~ ., distribution = "multinomial", 
    data = training_set, n.trees = 1000, interaction.depth = 2, 
    shrinkage = 0.01)
A gradient boosted model with multinomial loss function.
1000 iterations were performed.
There were 188 predictors of which 179 had non-zero influence.


In [12]:
preds <- predict.gbm(object = boost.diabt,
                     newdata = test_set,
                     n.trees = 1000,
                     type = "response")
print(preds)

, , 1000

      Female_Control Group Female_diabetes mellitus Male_Control Group
 [1,]         1.450209e-01             7.549297e-01       0.0456047197
 [2,]         2.727957e-03             9.886126e-01       0.0015527770
 [3,]         6.916201e-03             9.894049e-01       0.0013639838
 [4,]         5.449296e-01             3.633064e-01       0.0829648866
 [5,]         5.435920e-03             6.311247e-02       0.8661920583
 [6,]         1.197856e-02             6.896999e-01       0.1668867565
 [7,]         1.086088e-02             4.704908e-01       0.0066506445
 [8,]         2.094938e-03             9.930754e-01       0.0002561349
 [9,]         1.767297e-02             5.678751e-01       0.2082919446
[10,]         1.293237e-04             2.974186e-03       0.0008998135
[11,]         7.575705e-03             3.930274e-02       0.2616813346
[12,]         9.495537e-03             1.513046e-01       0.8243262657
[13,]         5.707878e-02             9.371547e-01       0.0008162

In [13]:
labels <- colnames(preds)[apply(preds, 1, which.max)]
result <- data.frame(test_set$gender_status, labels)
result$res <- result$test_set.gender_status == result$labels
print(result)

     test_set.gender_status                   labels   res
1    Male_diabetes mellitus Female_diabetes mellitus FALSE
2  Female_diabetes mellitus Female_diabetes mellitus  TRUE
3  Female_diabetes mellitus Female_diabetes mellitus  TRUE
4  Female_diabetes mellitus     Female_Control Group FALSE
5    Male_diabetes mellitus       Male_Control Group FALSE
6    Male_diabetes mellitus Female_diabetes mellitus FALSE
7    Male_diabetes mellitus   Male_diabetes mellitus  TRUE
8  Female_diabetes mellitus Female_diabetes mellitus  TRUE
9  Female_diabetes mellitus Female_diabetes mellitus  TRUE
10   Male_diabetes mellitus   Male_diabetes mellitus  TRUE
11   Male_diabetes mellitus   Male_diabetes mellitus  TRUE
12   Male_diabetes mellitus       Male_Control Group FALSE
13 Female_diabetes mellitus Female_diabetes mellitus  TRUE
14 Female_diabetes mellitus   Male_diabetes mellitus FALSE
15 Female_diabetes mellitus Female_diabetes mellitus  TRUE
16       Male_Control Group       Male_Control Group  TR

In [14]:
accuracy = sum(result$res)/nrow(result)
print(accuracy)

[1] 0.8


In [15]:
result %>%
  mutate(test_set.gender_status = factor(test_set.gender_status),
         pred.labels = factor(labels)) %>%
  conf_mat(test_set.gender_status,pred.labels)

                          Truth
Prediction                 Female_Control Group Female_diabetes mellitus
  Female_Control Group                        6                        1
  Female_diabetes mellitus                    0                        6
  Male_Control Group                          1                        0
  Male_diabetes mellitus                      1                        1
                          Truth
Prediction                 Male_Control Group Male_diabetes mellitus
  Female_Control Group                      0                      0
  Female_diabetes mellitus                  0                      2
  Male_Control Group                       17                      2
  Male_diabetes mellitus                    0                      3

## Tuning a boosting model

We now use `tidymodels` to build a recipe and workflow to tune our boosting model:

1. splitting the data in training and test sets
2. specify the preprocessing recipe (remove collinear/correlated variables, remove variables with no variance, normalize variables, impute missing data)
3. partition the training set in k-folds for cross-validation to tune hyperparameter
4. specify the boosting model:
    - "classification" mode
    - n. of trees (sequential models to combine)
    - min. n. of obs per node $\rightarrow$ tuning parameter
    - tree depth $\rightarrow$ tuning parameter
    - shrinkage parameter (learning rate) $\rightarrow$ tuning parameter
5. define the grid (combinations) of hyperparameters to test
6. put everything in a workflow
7. run the fine-tuning of hyperparameters

In [16]:
## data splitting
diab_dt <- select(mtbsl1, -c(`Primary ID`, Gender, Metabolic_syndrome))
diab_dt$gender_status <- factor(diab_dt$gender_status)
mtbsl1_split <- initial_split(diab_dt, strata = gender_status, prop = 0.7)
mtbsl1_train <- training(mtbsl1_split)
mtbsl1_test <- testing(mtbsl1_split)

In [17]:
## preprocessing
preprocessing_recipe <-
  recipes::recipe(gender_status ~ ., data = mtbsl1_train) %>%
  step_corr(all_predictors(), threshold = 0.9) %>% ## remove correlated variables
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_impute_knn(all_numeric(), neighbors = 5) %>%
  prep()

In [18]:
## k-fold cross-validation for tuning
diab_cv <- vfold_cv(mtbsl1_train, v=5, repeats = 5, strata = gender_status)

In [19]:
# XGBoost model specification
xgboost_model <- 
  boost_tree(
    mode = "classification",
    trees = 100, ## B parameter
    min_n = tune(),
    tree_depth = tune(), ## d parameter
    learn_rate = tune() 
  ) %>%
  set_engine("xgboost", objective = "multi:softprob", num_class = 4, lambda=0, alpha=1, verbose=0)

In [20]:
# grid specification
xgboost_params <- 
  parameters(
    min_n(),
    tree_depth(),
    learn_rate()
  )

xgboost_grid <- 
  grid_max_entropy(
    xgboost_params, 
    size = 15
  )

print(xgboost_grid)

[90m# A tibble: 15 × 3[39m
   min_n tree_depth learn_rate
   [3m[90m<int>[39m[23m      [3m[90m<int>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m    25         14   1.15[90me[39m[31m- 3[39m
[90m 2[39m    28          1   1.83[90me[39m[31m- 3[39m
[90m 3[39m     3          7   2.07[90me[39m[31m- 8[39m
[90m 4[39m    35          9   8.92[90me[39m[31m- 7[39m
[90m 5[39m     3          4   9.52[90me[39m[31m- 4[39m
[90m 6[39m    39         15   8.05[90me[39m[31m- 7[39m
[90m 7[39m    12         10   1.47[90me[39m[31m- 5[39m
[90m 8[39m    25          5   1.10[90me[39m[31m-10[39m
[90m 9[39m    36          9   9.66[90me[39m[31m- 2[39m
[90m10[39m    35          2   6.78[90me[39m[31m- 9[39m
[90m11[39m    17          5   1.74[90me[39m[31m- 6[39m
[90m12[39m    14         10   6.10[90me[39m[31m-10[39m
[90m13[39m     4         14   8.50[90me[39m[31m- 8[39m
[90m14[39m    11          1   3.17[90me[39m[31m-10[39

In [21]:
## workflow
xgboost_wf <- 
  workflows::workflow() %>%
  add_model(xgboost_model) %>% 
  add_formula(gender_status ~ .)

In [22]:
# hyperparameter tuning
xgboost_tuned <- tune_grid(
  object = xgboost_wf,
  resamples = diab_cv,
  grid = xgboost_grid,
  # metrics = yardstick::metric_set(rmse, rsq, mae),
  control = control_grid(verbose = FALSE)
)

In [23]:
## explore tuning results
collect_metrics(xgboost_tuned)

min_n,tree_depth,learn_rate,.metric,.estimator,mean,n,std_err,.config
<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
25,14,0.00114876,accuracy,multiclass,0.4290205,25,0.002041774,Preprocessor1_Model01
25,14,0.00114876,roc_auc,hand_till,0.5,25,0.0,Preprocessor1_Model01
28,1,0.001832832,accuracy,multiclass,0.2582602,25,0.018316935,Preprocessor1_Model02
28,1,0.001832832,roc_auc,hand_till,0.5,25,0.0,Preprocessor1_Model02
3,7,2.073682e-08,accuracy,multiclass,0.5122953,25,0.017415771,Preprocessor1_Model03
3,7,2.073682e-08,roc_auc,hand_till,0.7019709,25,0.013652924,Preprocessor1_Model03
35,9,8.923622e-07,accuracy,multiclass,0.2082602,25,0.002311799,Preprocessor1_Model04
35,9,8.923622e-07,roc_auc,hand_till,0.5,25,0.0,Preprocessor1_Model04
3,4,0.0009520798,accuracy,multiclass,0.6437281,25,0.019274687,Preprocessor1_Model05
3,4,0.0009520798,roc_auc,hand_till,0.7820271,25,0.013864367,Preprocessor1_Model05


In [24]:
$\rightarrow$ tuning parameterlibrary("repr")
options(repr.plot.width=14, repr.plot.height=8)

xgboost_tuned %>%
  collect_metrics() %>%
  filter(.metric == "accuracy") %>%
  select(mean, min_n:learn_rate) %>%
  pivot_longer(min_n:learn_rate,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "accuracy")

ERROR: Error in parse(text = x, srcfile = src): <text>:1:1: unexpected '$'
1: $
    ^


### Select and evaluate the best model

We show the best models in terms of ROC AUC. then:

- we select the most accurate model
- we add the best model to the workflow $\rightarrow$ final workflow
- fit the final model on the data split (fit on training data, evaluate on test data)
- collect results and look at key metrics
- calculate the accuracy of predictions (confusion matrix)
- finally, extract variable importance

In [None]:
xgboost_tuned %>%
  show_best(metric = "roc_auc")

In [None]:
xgboost_best_params <- xgboost_tuned %>%
  select_best("accuracy")

print(xgboost_best_params)

In [None]:
final_xgb <- finalize_workflow(
  xgboost_wf,
  xgboost_best_params
)

In [None]:
final_res <- last_fit(final_xgb, mtbsl1_split)
collect_metrics(final_res)

collect_predictions(final_res) %>%
  metrics(gender_status, .pred_class)


In [None]:
cm <- collect_predictions(final_res) %>%
  conf_mat(gender_status, .pred_class)

print(cm)

In [None]:
autoplot(cm, type="heatmap")

In [None]:
library("vip")

final_xgb %>%
  fit(data = juice(preprocessing_recipe)) %>%
  pull_workflow_fit() %>%
  vip(geom = "point")