<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 5px; height: 70px">

# Optimising Hospital Bed Occupancy through Machine Learning
**DSI-41 Group FWSG**: Muhammad Faaiz Khan, Sharifah Nurulhuda, Tan Wei Chiong, Gabriel Tan

### 03_02 Modelling: Random Forests

### Overview

Random forest takes a subset of the features at random to train a single prediction. Each prediction is then averaged to produce the final prediction.

We also run Grid Search CV in this notebook to search for the best values for the random forest hyperparameters.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler


### Load Data

In [2]:
#read csv
df = pd.read_csv('../datasets/for_regression.csv')

In [3]:
# view all columns in the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 45 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   year                                      730 non-null    int64  
 1   nuh                                       730 non-null    float64
 2   total_rainfall                            730 non-null    float64
 3   wet_bulb_temperature                      730 non-null    float64
 4   dengue                                    730 non-null    float64
 5   urti                                      730 non-null    float64
 6   diarrhea                                  730 non-null    float64
 7   num__wet_bulb_temperature                 730 non-null    float64
 8   num__total_rainfall                       730 non-null    float64
 9   num__dengue                               730 non-null    float64
 10  num__urti                             

In [4]:
#dropping columns 
df.drop(columns=['wet_bulb_temperature','total_rainfall','dengue','urti','diarrhea'],inplace=True)

In [5]:
# view all columns, nulls and data types
df.head()

Unnamed: 0,year,nuh,num__wet_bulb_temperature,num__total_rainfall,num__dengue,num__urti,num__diarrhea,num__wet_bulb_temperature^2,num__wet_bulb_temperature total_rainfall,num__wet_bulb_temperature dengue,...,cat__month_May,cat__month_Nov,cat__month_Oct,cat__month_Sep,cat__day_Mon,cat__day_Sat,cat__day_Sun,cat__day_Thu,cat__day_Tue,cat__day_Wed
0,2022,72.0,26.0,25.2,11.0,904.0,287.0,676.0,655.2,286.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,2022,73.7,24.1,50.0,15.0,977.0,297.0,580.81,1205.0,361.5,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2022,78.7,25.6,6.0,15.0,977.0,297.0,655.36,153.6,384.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,2022,79.1,25.8,4.8,15.0,977.0,297.0,665.64,123.84,387.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2022,81.8,26.3,1.0,15.0,977.0,297.0,691.69,26.3,394.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Grid Search CV

Create pipeline for the Grid Searching the best parameters for random forest for the 3 hospitals and output the scores

In [6]:
#define X
X = df.drop(columns='nuh')

#define y
y = df['nuh']

#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Create a pipeline with preprocessing and RandomForestRegressor
rf = RandomForestRegressor(random_state=42)  # You can adjust other hyperparameters

# create a preprocessing transformer to scale numeric cols
num_feats = X.select_dtypes(include='number').columns
ss = ColumnTransformer(
    transformers=[
        ('numeric', StandardScaler(), num_feats)],
    remainder='passthrough')

pipe = Pipeline([
    ('preprocessor', ss),   # Step 1: Transformation
    ('estimator', rf)       # Step 2: Estimation with RandomForestRegressor
])

# Set up the parameter grid for hyperparameters
param_grid = {
    'estimator__n_estimators': [50, 100, 200],   # Number of trees in the forest
    'estimator__max_depth': [None, 10, 20],       # Maximum depth of the tree
    'estimator__min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
}

# Instantiate GridSearchCV
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# Fit the pipeline with GridSearchCV
grid_search.fit(X_train, y_train)

#predict performance on X_test
y_pred = grid_search.predict(X_test)

# Access the best hyperparameters from the grid search results
best_params = grid_search.best_params_

#evaluate model with RMSE, train and test scores
train_score = grid_search.score(X_train,y_train)
test_score = grid_search.score(X_test,y_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))


#display results for model RMSE and R2 score
print(f'For NUH')
print("Best hyperparameters:", best_params)

print(f'Random Forest Train R2-Score : {train_score:.4f}')
print(f'Random Forest Test R2-Score : {test_score:.4f}')
print(f'Random Forest RMSE : {rmse:.4f}\n')

For NUH
Best hyperparameters: {'estimator__max_depth': 20, 'estimator__min_samples_split': 10, 'estimator__n_estimators': 200}
Random Forest Train R2-Score : 0.8522
Random Forest Test R2-Score : 0.5437
Random Forest RMSE : 2.6834



### Insights

The results from the Random Forest Model above are as follows:

|           **Hospital**           |   NUH  |
|:--------------------------------:|:------:|
|          **Train Score**         | 0.8522 |
|          **Test Score**          | 0.5437 |
|          **RMSE Value**          | 2.6834 |
|     **estimator__max_depth**     |   20   |
| **estimator__min_samples_split** |   10   |
|    **estimator__n_estimators**   |   200  |


**Insight 1:** 
Based on the R2 scores above, we can see that there is evidence of overfitting in the model.

**Evidence:**
* Large difference between the train and test r2 scores

**Insight 2:**
Decent RMSE values, showing around 2-3% error.