In [None]:
library("knitr")
library("glmnet")
library("ggplot2")
library("tidyverse")
library("tidymodels")
library("data.table")

## Lasso-penalised logistic regression using tidymodels

For this illustration, we will use the metabolomics dataset on NMR spectroscopy metabolite abundances in diabetes patients (with controls):

In [None]:
mtbsl1 <- fread("../data/MTBSL1.tsv")
nrow(mtbsl1)

In [None]:
head(mtbsl1)

In [None]:
names(mtbsl1)[c(4:ncol(mtbsl1))] <- paste("mtbl",seq(1,ncol(mtbsl1)-3), sep = "_")
print(paste("N. of columns;", ncol(mtbsl1)))

We have **132 records** and **189 variables** (191 - 2 (`ID` and `Metabolic_syndrome`)): still, $p > n$.
We see now the distribution of classes (control, diabetes) per sex (not dramatically unbalanced):

In [None]:
table(mtbsl1$Metabolic_syndrome,mtbsl1$Gender)

We can see that the metabolites abundances have largely different scales/magnitudes, and it could be a good idea to normalise them before running the model (remember that Lasso constraints the size of the coefficients, and these depend on the magnitude of each variable $\rightarrow$ same $\lambda$ applied to all variables).

This is the range between maximum values across metabolites:

In [None]:
mm_diab <- mtbsl1 %>%
  gather(key = "metabolite", value = "abundance", -c(`Primary ID`,Gender,Metabolic_syndrome))

group_by(mm_diab, metabolite) %>% summarise(max_each=max(abundance)) %>% summarise(min = min(max_each), max = max(max_each))

Below the boxplots of the distributions of values for all metabolites:

In [None]:
library("repr")
options(repr.plot.width=14, repr.plot.height=8)

mm_diab %>%
  ggplot(aes(metabolite, abundance, fill = as.factor(metabolite))) +
  geom_boxplot(show.legend = FALSE)

### Training and test sets

We now split the data in the training and test sets using `tidymodels` functions: we now split in a stratified way, because we want to keep a similar proportion of cases and controls in both the training and test sets. 

In [None]:
diab_dt <- select(mtbsl1, -c(`Primary ID`, Gender)) ## remove gender for the moment, and keep only numeric features for convinience
mtbsl1_split <- initial_split(diab_dt, strata = Metabolic_syndrome)
mtbsl1_train <- training(mtbsl1_split)
mtbsl1_test <- testing(mtbsl1_split)

### Preprocessing

We remove variables with zero variance (no variability, not informative) and normalise all numeric variables: to do so, we build a preprocessing "recipe", where we specify:

- the model equation: diabetes/control as a function of all metabolites
- remove all predictors (non-outcome variables) that don't have variance
- normalize all numeric predictors (standard deviation of one and a mean of zero)

In [None]:
## build a recipe for preprocessing
mtbsl1_rec <- recipe(Metabolic_syndrome ~ ., data = mtbsl1_train) %>%
  # update_role(`Primary ID`, new_role = "ID") %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes())

print(mtbsl1_rec)

Now we use the `tidymodels` functions `prep()` and `juice()` to obtain the preprocessed training set:

In [None]:
mtbsl1_prep <- mtbsl1_rec %>%
  prep(strings_as_factors = FALSE)

print(mtbsl1_prep)
mtbsl1_train <- juice(mtbsl1_prep)

In [None]:
mm_train <- mtbsl1_train %>%
  gather(key = "metabolite", value = "abundance", -c(Metabolic_syndrome))

print(group_by(mm_train, metabolite) %>% summarise(mean(abundance),sd(abundance)))

In [None]:
options(repr.plot.width=14, repr.plot.height=8)
mm_train %>%
  ggplot(aes(metabolite, abundance, fill = as.factor(metabolite))) +
  geom_boxplot(show.legend = FALSE)

## The Lasso model

We start by specifying our Lasso model:

- it's logistic regression for a classification problem
- we set the $\lambda$ parameter (penalty) to the arbitrary value of 0.1
- mixture: amount of L1 regularization, when 1 it's Lasso (0 is Ridge regression)
- the engine is set to `glmnet`

Then we add everything to a workflow object, piecewise, and fit the Lasso model:

In [None]:
lasso_spec <- logistic_reg(mode = "classification", penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")
print(lasso_spec)

In [None]:
wf <- workflow() %>%
  add_recipe(mtbsl1_rec) %>%
  add_model(lasso_spec)

In [None]:
lasso_fit <- wf %>%
  fit(data = mtbsl1_train)

In [None]:
lasso_fit %>%
  pull_workflow_fit() %>%
  tidy()

In [None]:
lasso_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  filter(estimate > 0)

## Tuning the hyperparameters

We use k-fold cross-validation to tune the hyperparameters ($\lambda$ penalty in this case) in the training set:

- `vfold_cv`: to specify n. of folds (stratified) and replicates
- `logistic_reg`: to specify that we want a logistic regression model for classification, we want to use Lasso penalization, and we are fine-tuning the penalty parameter
- `grid_regular`: defines the range of penalty parameter values to try

In [None]:
diab_cv <- vfold_cv(mtbsl1_train, v=5, repeats = 10, strata = Metabolic_syndrome)

tune_spec <- logistic_reg(mode = "classification", penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

lambda_grid <- grid_regular(penalty(), levels = 50, filter = penalty <= .05)
print(lambda_grid)

In [None]:
wf1 <- workflow() %>%
  add_recipe(mtbsl1_rec) %>%
  add_model(tune_spec) ## remember: the model equation was specified in the recipe (top of this document)

print(wf1)

We are now ready to **fine-tune the model**!!

In [None]:
doParallel::registerDoParallel()

lasso_grid <- tune_grid(
  wf1,
  resamples = diab_cv,
  grid = lambda_grid
)

Here we can see the results for each value of the penalty parmeter that was tried in the fine-tuning process:

In [None]:
lasso_grid %>%
  collect_metrics()

Plotting the results will help us see what happened during fine-tuning of the model, and then select the best value for $\lambda$ (the penalty parameter) based on the maximum AUC (binary classification problem):

In [None]:
lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.5
  ) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none")

In [None]:
lowest_roc <- lasso_grid %>%
  select_best("roc_auc")

print(lowest_roc)

With the selected penalty parameter from fine-tuning, we can (finally!)  finalize our workflow: 

In [None]:
final_lasso <- finalize_workflow(
  wf1,
  lowest_roc
)

print(final_lasso)

## Testing the model

We are now ready to test our fine-tuned Lasso model on the test partition:

In [None]:
lr_res <- last_fit(
  final_lasso,
  mtbsl1_split
) 

lr_res %>%
  collect_metrics()


In [None]:
lr_res %>% collect_predictions() %>%
  group_by(.pred_class,Metabolic_syndrome) %>%
  summarise(N=n()) %>%
  spread(key = ".pred_class", value = N)

We can also do it step-by-step:

1. preprocess the test data
2. fit the final Lasso model to the training data
3. make predictions on the test data


In [None]:
mtbsl1_testing <- mtbsl1_prep %>% bake(testing(mtbsl1_split)) ## preprocess test data

final_lasso_fit <- fit(final_lasso, data = mtbsl1_train) ## fit final model on the training set

final_lasso_fit %>%
  predict(mtbsl1_testing, type = "class") %>%
  bind_cols(mtbsl1_testing) %>%
  # metrics(truth = Metabolic_syndrome, estimate = .pred_class)
  group_by(.pred_class,Metabolic_syndrome) %>%
  summarise(N=n()) %>%
  spread(key = ".pred_class", value = N)

In [None]:
lr_auc <- 
  lr_res %>% 
  collect_predictions() %>% 
  roc_curve(Metabolic_syndrome,`.pred_Control Group`) %>% 
  mutate(model = "Logistic Regression")

autoplot(lr_auc)

### Variable importance

Lasso models have a nice side feature: they naturally select variables, based on the shrinking of some coefficients exactly to zero.
Based on this, Lasso models can return the importance of the variables used to fit the model:

In [None]:
library("vip")

## the sign of the coefficients is used to color differently the variables
final_lasso %>%
  fit(mtbsl1_train) %>%
  pull_workflow_fit() %>%
  vi(lambda = lowest_roc$penalty) %>%
  mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  ) %>%
  filter(Importance > 0) %>%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col() +
  scale_x_continuous(expand = c(0, 0)) +
  labs(y = NULL)

## Add gender (factor) to the Lasso model

To add factors to the Lasso model (engine `glmnet`) we need to use [one hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [None]:
diab_dt <- model.matrix(~ ., mtbsl1[,"Gender"]) %>% as_tibble() %>% select(GenderMale) %>% rename(gender = GenderMale) %>% 
bind_cols(diab_dt)

head(diab_dt)

#### Interactive exercise

We'll now rerun a Lasso-penalised logistic regression model including Gender as feature, going together through the steps involved in Lasso models with `tidymodels`:

1. splitting the data
2. preprocessing
3. hyperparameters tuning
4. fitting the final model
5. predictions on the test set

In [None]:
## code here


In [None]:
## code here