## Introduction
This notebook provides an interactive environment for training a predictive model to estimate patient length of stay (LOS). This workflow is particularly suited for healthcare professionals and researchers interested in ICU resource planning and patient flow management.

Users are only requested to change the code block indicated with 

In [1]:
### ‼️User Action Required

### Package Installation
This code checks for the presence of required R packages and installs them if they are not already available.
These packages are essential for data preprocessing, model training, ensemble learning, and performance evaluation:
- `caret`: Core package for building and tuning predictive models.
- `caretEnsemble`: Allows combining multiple caret models into an ensemble for improved accuracy.
- `tidyverse`: Collection of packages for data manipulation, visualization, and general workflow.
- `MLmetrics`: Provides machine learning evaluation metrics (e.g., MAE, RMSE).
- `ranger`: Fast implementation of Random Forests, useful for training tree-based models efficiently.

⚠️ You only need to run this block once per session or when setting up a new environment.


In [2]:
import pandas as pd
import pyreadr
import os
import sys
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, make_scorer,  mean_absolute_error, r2_score
from sklearn.ensemble import StackingRegressor
import joblib

### Load and Validate User Dataset

1. Please change the `data_path` the **path to the data** you want to train the model on. It can be in `.csv` or `.RData` format. If it's an `R.Data` file, please include the object name in the `object_name` variable.

2. If you want to include your **own predictors**, please change the `predictors` variable to include your a dataframe with one column stating the names of your predictors.

If not, the the list of predictor variables used during the original model training is automatically loaded.

⚠️ If not using your own predictors, make sure your dataset includes all required predictors listed in predictors.csv, as well as the target variable UnitLengthStay_trunc.

In [3]:
### ‼️User Action Required

data_path = 'Synthetic_Data.csv'  # or 'your_data.RData'
predictors_path = 'predictors.csv'
object_name = 'your_object_name_in_RData'  # used only if .RData

In [1]:
def load_user_data(data_path, object_name):
    if data_path.lower().endswith(".csv"):
        user_data = pd.read_csv(data_path)
    elif data_path.lower().endswith(".rdata"):
        result = pyreadr.read_r(data_path)
        if object_name not in result:
            raise ValueError(f"The .RData file does not contain an object named '{object_name}'")
        user_data = result[object_name]
    else:
        raise ValueError("Unsupported file type. Please upload a .csv or .RData file.")
    missing_fraction = user_data.isna().mean()
    user_data = user_data.loc[:, missing_fraction <= 0.3]
    
    return user_data

In [5]:
def select_predictors(user_data, predictors_path):
    predictors_df = pd.read_csv(predictors_path)
    predictors = predictors_df.iloc[:, 1].tolist()  # Second column (like predictors[,2])

    missing = [col for col in predictors if col not in user_data.columns]
    if missing:
        raise ValueError(f"Some required predictors are missing from your dataset: {missing}")

    selected = user_data[predictors + ['UnitLengthStay_trunc']]
    return selected

### Model Traning
This section covers the full training pipeline, from splitting the data to building an ensemble model using sklearn.ensemble.

**Steps**:
1. **Train-Test Split**

- A reproducible 80/20 split is created using train_test_split().

2. **Model Training**
   
- Two base models are trained: Linear regression (sklearn.linear_model) and Random Forest (sklearn.ensemble).These models are stored in the variables lm and best_rf. 

4. **Model Stacking**

- A stacked ensemble model is built from the base learners.The final stacked model is saved as "best_stacked_model".

In [6]:
def split_data(dataframe, test_size=0.2, random_state=42):
    X = dataframe.drop(columns='UnitLengthStay_trunc')
    y = dataframe['UnitLengthStay_trunc']
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

In [7]:
rf_params = {
    'max_features': [5, 6, 7, 8, 9, 10],
    'min_samples_leaf': [5], 
    'criterion': ['squared_error'],  
    'random_state': [42]
}
rf_base = RandomForestRegressor()

user_data = load_user_data(data_path, object_name)
filtered_data = select_predictors(user_data, predictors_path)
X_train, X_test, y_train, y_test = split_data(filtered_data)

X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)

X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

cv = KFold(n_splits=5, shuffle=True, random_state=42)
rf_grid = GridSearchCV(rf_base, rf_params, cv=cv, scoring='neg_root_mean_squared_error', verbose=1)
rf_grid.fit(X_train_encoded, y_train)

best_rf = rf_grid.best_estimator_

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [8]:
lm = LinearRegression()
lm.fit(X_train_encoded, y_train)

In [9]:
param_grid = {
    'final_estimator__max_features': [2, 5, 10],
    'final_estimator__min_samples_leaf': [5, 10, 15],
    'final_estimator__criterion': ['squared_error', 'absolute_error']
}

stack = StackingRegressor(
    estimators=[
        ('lm', LinearRegression()),
        ('rf_base', RandomForestRegressor(max_features=9, min_samples_leaf=5, random_state=42))
    ],
    final_estimator=RandomForestRegressor(),
    passthrough=False,
    verbose=1
)

grid = GridSearchCV(
    estimator=stack,
    param_grid=param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error'
)
grid.fit(X_train_encoded, y_train)

best_stacked_model = grid.best_estimator_

### Model Prediction and Evaluation
This section handles making predictions with the trained stacked model and evaluating its performance using key metrics.

1. the `predict` call generates **predictions** for the test set using the stacked model.

2. We **evaluate** the trained model performance on three metrics:

- Root Mean Squared Error (RMSE): Measures the average magnitude of the prediction errors.

- Mean Absolute Error (MAE): Measures the average of the absolute errors, giving a sense of how far off the predictions are.

- R-squared (R2): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.

These metrics are computed using the functions available in the MLMetrics package.

In [None]:
y_pred = best_stacked_model.predict(X_test_encoded)

In [17]:
rmse = (mean_squared_error(y_test, y_pred))**(1/2)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")

RMSE: 3.7578
MAE: 2.6584
R²: 0.1662
