## Introduction
This notebook provides an interactive environment for training a predictive model to estimate patient length of stay (LOS). This workflow is particularly suited for healthcare professionals and researchers interested in ICU resource planning and patient flow management.

Users are only requested to change the code block indicated with 

In [None]:
### ‼️User Action Required

All the other blocks should work without interference.
Warnings and notices are preceded by ⚠️

### Load and Validate User Dataset

1. Please change the `data_path` the **path to the data** you want to train the model on. It must be in `.RData` format. Please include the **object name** of the dataset in the `object_name` variable.

2. If you want to include your **own predictors**, please change the `predictors` variable to include your a dataframe with one column stating the names of your predictors. If not, the the list of predictor variables used during the original model training is automatically loaded.

3. Add the **name** you want the model to be saved as to the `user_model_name` variable. the model will be saved as an .RData file

⚠️ If not using your own predictors, make sure your dataset includes all required predictors listed in predictors.csv, as well as the target variable UnitLengthStay_trunc.

In [None]:
if (!require("caret")) {install.packages("caret", dependencies = TRUE) ; library(caret)}
if (!require("caretEnsemble")) {install.packages("caretEnsemble", dependencies = TRUE) ; library(caretEnsemble)}
if (!require("tidyverse")) {install.packages("tidyverse") ; library(tidyverse)}
if (!require("MLmetrics")) {install.packages("MLmetrics") ; library(MLmetrics)}
if (!require("ranger")) {install.packages("ranger"); library(ranger) }
if (!require("DescTools")) {install.packages("DescTools"); library(DescTools) }
if (!require("mice")) {install.packages("mice"); library(mice) }

In [None]:
### ‼️User Action Required

data_path = "C:\\Users\\joana\\Documentos\\SLOS\\SLOS retraining\\SampledData.RData"
object_name = "sampled_data"
user_model_name = "YOUR_MODEL_NAME"
predictors = "C:\\Users\\joana\\Documentos\\SLOS\\SLOS retraining\\predictors.csv" # default predictors

In [None]:
if (grepl("\\.RData$", data_path, ignore.case = TRUE)) {
  load(data_path)
  if (!exists(object_name)) {
    stop(paste("The .RData file does not contain an object named", object_name))
  }
  user_data <- get(object_name)
} else {
  stop("Unsupported file type. Please upload an .RData file.")
}

predictors <- read.csv(predictors)
predictors <- predictors[,2]  

if (!all(predictors %in% names(user_data))) {
  stop("Some required predictors are missing in your dataset.")
}

if (!"UnitLengthStay_trunc" %in% names(user_data)) {
    stop("The target variable UnitLengthStay_trunc is missing in your dataset.")
}

user_data <- user_data %>%
  select(any_of(predictors), UnitLengthStay_trunc)

### Data Pre-processing
Remove zero and near-zero variance features, correlated predictors (for numeric and categorical features) and impute missing data using the MICE algortihm

In [None]:
set.seed(998)
inTraining <- createDataPartition(user_data$UnitLengthStay,
                                  p = .8, list = FALSE)
training <- user_data[ inTraining,]
training_dummy <- training
testing  <- user_data[-inTraining,]
testing_dummy <- testing

#Identifying and Removing Zero and Near Zero variance features
nzv = nearZeroVar(training, saveMetrics = T, freqCut = 100/2)
nzv["Variables"] = row.names(nzv)
desc_nzv = nzv%>%
  filter(nzv==T)%>%
  select(Variables,freqRatio,percentUnique)
removed_nzv = desc_nzv$Variables

training = training %>%
  select(.,-removed_nzv)
testing = testing %>%
  select(.,-removed_nzv)

# Identifying and Removing Correlated Predictors (for numeric features)
training_pre_numeric = training %>%
  select_if(., is.numeric)
training_pre_numeric$UnitLengthStay = NULL
descrCor <-  cor(training_pre_numeric, 
                 use="pairwise.complete.obs")

highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
removed_cor = colnames(training_pre_numeric[,highlyCorDescr])
training_pre_numeric = 
  training_pre_numeric[,-highlyCorDescr]

testing_pre_numeric = testing %>%
  select_if(., is.numeric)
testing_pre_numeric$UnitLengthStay = NULL
testing_pre_numeric = 
  testing_pre_numeric[,-highlyCorDescr]


# Identifying and Removing Correlated Predictors (for categorical features)
training_pre_factor = training %>%
  select_if(., is.factor)
cramer_tab = PairApply(training_pre_factor,
                       CramerV, symmetric = TRUE)
cramer_tab[which(is.na(cramer_tab[,])==T)] = 0

highlyCorCateg <- findCorrelation(cramer_tab, cutoff = 0.5)
removed_categ = colnames(training_pre_factor[,highlyCorCateg])
training_pre_factor = training_pre_factor %>%
  select(.,-removed_categ)

testing_pre_factor = testing %>%
  select_if(., is.factor)
testing_pre_factor = testing_pre_factor %>%
  select(.,-removed_categ)

training = cbind(training_pre_numeric,training_pre_factor, training$UnitLengthStay)
training$UnitLengthStay = training$`training$UnitLengthStay`
training$`training$UnitLengthStay` = NULL

testing = cbind(testing_pre_numeric,testing_pre_factor, testing$UnitLengthStay)
testing$UnitLengthStay = testing$`testing$UnitLengthStay`
testing$`testing$UnitLengthStay` = NULL

#MICE Imputation
training_imp = training
testing_imp = testing

  #training
set.seed(100)
predictormatrix = quickpred(training_imp,
                          include = c("UnitLengthStay"),
                          exclude = NULL,
                          mincor = 0.3)
imp_gen = mice(data = training_imp,
               predictorMatrix = predictormatrix,
               m=1,
               maxit = 5,
               diagnostics=TRUE)

imp_data = mice::complete(imp_gen,1)
training_imp = imp_data
summary(training_imp)
training_imp$UnitLengthStay_trunc <- training_dummy$UnitLengthStay_trunc
training <- training_imp

#testing
set.seed(100)
predictormatrix = quickpred(testing_imp,
                            include = c("UnitLengthStay"),
                            exclude = NULL,
                            mincor = 0.3)
imp_gen_test = mice(data = testing_imp,
               predictorMatrix = predictormatrix,
               m=1,
               maxit = 5,
               diagnostics=TRUE)
imp_data_test = mice::complete(imp_gen_test,1)
testing_imp = imp_data_test
summary(testing_imp)
testing_imp$UnitLengthStay_trunc <- testing_dummy$UnitLengthStay_trunc
testing <- testing_imp

### Model Traning
This section covers the full training pipeline, from splitting the data to building an ensemble model using caretStack.

**Steps**:
1. **Train-Test Split**

- A reproducible 80/20 split is created using createDataPartition().

2. **Cross-Validation Setup**

- `trainControl()` defines a 5-fold cross-validation strategy with progress output (verboseIter) and final model predictions retained (savePredictions = "final")

3. **Model Training with caretList**

- Two base models are trained using caretList(): Linear regression (lm) and Random Forest (ranger) with a tuning grid for mtry, splitrule, and min.node.size. These models are stored in model_list and saved as "user_model_list.RData".

4. **Model Stacking with caretStack**

- A stacked ensemble model is built from the base learners using caretStack(). A secondary Random Forest model (with its own tuning grid) is used to combine the predictions. The final stacked model is saved as "user_trained_SLOS_model.RData".

This ensures both the individual models and the stacked model can be reused or deployed later.



In [None]:
fitControl <- trainControl(
  method = "cv", 
  number = 5, 
  verboseIter = TRUE, 
  returnData = FALSE,
  trim = TRUE,
  savePredictions = "final"
)

In [None]:
model_list <- caretList(
  x = training[, -ncol(training)],
  y = training$UnitLengthStay_trunc,
  trControl = fitControl,
  metric = "RMSE",
  tuneList = list(
    lm = caretModelSpec(method = "lm"),
    rf = caretModelSpec(method = "ranger", tuneGrid = data.frame(
      .mtry = c(5:10),
      .splitrule = "variance",
      .min.node.size = 5
    ))
  )
)

In [None]:
rfGrid <- expand.grid(
  mtry = 2,
  min.node.size = c(5,10,15,20),
  splitrule = c("variance", "extratrees", "maxstat")
)

stacked_model <- caretStack(
  model_list,
  trControl = fitControl,
  metric = "RMSE",
  method = "ranger",
  tuneGrid = rfGrid
)

In [None]:
substrRight <- function(x, n){
  substr(x, nchar(x) - n + 1, nchar(x))
}

if (substrRight(user_model_name, 6) != ".RData") {
  user_model_name <- paste0(user_model_name, ".RData")
}

save(stacked_model, file = user_model_name)

### Model Prediction and Evaluation
This section handles making predictions with the trained stacked model and evaluating its performance using key metrics.

1. the predict(stacked_model, newdata = testing) call generates **predictions** for the test set using the stacked model.

2. We **evaluate** the trained model performance on three metrics:

- Root Mean Squared Error (RMSE): Measures the average magnitude of the prediction errors.

- Mean Absolute Error (MAE): Measures the average of the absolute errors, giving a sense of how far off the predictions are.

- R-squared (R2): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.

These metrics are computed using the functions available in the MLMetrics package.

In [None]:
predictions <- predict(stacked_model, newdata = testing)

In [None]:
rmse <- RMSE(predictions$pred, testing$UnitLengthStay_trunc)
MAE <- MAE(predictions$pred, testing$UnitLengthStay_trunc)
R2 <- R2(predictions$pred, testing$UnitLengthStay_trunc)
cat("RMSE:", rmse, "\n")
cat("MAE:", MAE, "\n")
cat("R2:", R2, "\n")