<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 5px; height: 70px">

# Optimising Hospital Bed Occupancy through Machine Learning
**DSI-41 Group FWSG**: Muhammad Faaiz Khan, Sharifah Nurulhuda, Tan Wei Chiong, Gabriel Tan

### 03_01 Linear Regression with RidgeCV and LassoCV

### Overview

In this notebook, we will be running the following regression models using the data with features that have been engineered using polynomial features.

* Linear Regression
* Ridge CV
* Lasso CV

The purpose is to explore the linear relationships between the features in our dataset, while also exploring regularisation to tackle overfitting and improve predictive performance.

|                            Linear Regression                            |                         Ridge CV                        |                                       Lasso CV                                      |
|:-----------------------------------------------------------------------:|:-------------------------------------------------------:|:-----------------------------------------------------------------------------------:|
|    Used to model the linear relationship between continuous variables   | Regularisation technique to penalise large coefficients | Regularisation technique to penalise for the sum of absolute values of coefficients |
| Tries to minimise the discrepancies between predicted and actual values |                To correct for overfitting               |                    To reduce the coefficients (sometimes to zero)                   |

### Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler


### Load Data

In [2]:
# read csv
df = pd.read_csv('../datasets/for_regression.csv')

In [3]:
# view all columns in the dataset
df.columns

Index(['year', 'nuh', 'total_rainfall', 'wet_bulb_temperature', 'dengue',
       'urti', 'diarrhea', 'num__wet_bulb_temperature', 'num__total_rainfall',
       'num__dengue', 'num__urti', 'num__diarrhea',
       'num__wet_bulb_temperature^2',
       'num__wet_bulb_temperature total_rainfall',
       'num__wet_bulb_temperature dengue', 'num__wet_bulb_temperature urti',
       'num__wet_bulb_temperature diarrhea', 'num__total_rainfall^2',
       'num__total_rainfall dengue', 'num__total_rainfall urti',
       'num__total_rainfall diarrhea', 'num__dengue^2', 'num__dengue urti',
       'num__dengue diarrhea', 'num__urti^2', 'num__urti diarrhea',
       'num__diarrhea^2', 'cat__is_holiday_1', 'cat__month_Aug',
       'cat__month_Dec', 'cat__month_Feb', 'cat__month_Jan', 'cat__month_Jul',
       'cat__month_Jun', 'cat__month_Mar', 'cat__month_May', 'cat__month_Nov',
       'cat__month_Oct', 'cat__month_Sep', 'cat__day_Mon', 'cat__day_Sat',
       'cat__day_Sun', 'cat__day_Thu', 'cat__day_Tue

In [4]:
# view all columns, nulls and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 45 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   year                                      730 non-null    int64  
 1   nuh                                       730 non-null    float64
 2   total_rainfall                            730 non-null    float64
 3   wet_bulb_temperature                      730 non-null    float64
 4   dengue                                    730 non-null    float64
 5   urti                                      730 non-null    float64
 6   diarrhea                                  730 non-null    float64
 7   num__wet_bulb_temperature                 730 non-null    float64
 8   num__total_rainfall                       730 non-null    float64
 9   num__dengue                               730 non-null    float64
 10  num__urti                             

In [5]:
#dropping additional numerical values, alexandra hospital and ntfgh
df.drop(columns=['wet_bulb_temperature','total_rainfall','dengue','urti','diarrhea'],inplace=True)

In [6]:
#viewing final columns
df.columns

Index(['year', 'nuh', 'num__wet_bulb_temperature', 'num__total_rainfall',
       'num__dengue', 'num__urti', 'num__diarrhea',
       'num__wet_bulb_temperature^2',
       'num__wet_bulb_temperature total_rainfall',
       'num__wet_bulb_temperature dengue', 'num__wet_bulb_temperature urti',
       'num__wet_bulb_temperature diarrhea', 'num__total_rainfall^2',
       'num__total_rainfall dengue', 'num__total_rainfall urti',
       'num__total_rainfall diarrhea', 'num__dengue^2', 'num__dengue urti',
       'num__dengue diarrhea', 'num__urti^2', 'num__urti diarrhea',
       'num__diarrhea^2', 'cat__is_holiday_1', 'cat__month_Aug',
       'cat__month_Dec', 'cat__month_Feb', 'cat__month_Jan', 'cat__month_Jul',
       'cat__month_Jun', 'cat__month_Mar', 'cat__month_May', 'cat__month_Nov',
       'cat__month_Oct', 'cat__month_Sep', 'cat__day_Mon', 'cat__day_Sat',
       'cat__day_Sun', 'cat__day_Thu', 'cat__day_Tue', 'cat__day_Wed'],
      dtype='object')

In [7]:
#view first 5 rows in the dataset
df.head()

Unnamed: 0,year,nuh,num__wet_bulb_temperature,num__total_rainfall,num__dengue,num__urti,num__diarrhea,num__wet_bulb_temperature^2,num__wet_bulb_temperature total_rainfall,num__wet_bulb_temperature dengue,...,cat__month_May,cat__month_Nov,cat__month_Oct,cat__month_Sep,cat__day_Mon,cat__day_Sat,cat__day_Sun,cat__day_Thu,cat__day_Tue,cat__day_Wed
0,2022,72.0,26.0,25.2,11.0,904.0,287.0,676.0,655.2,286.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,2022,73.7,24.1,50.0,15.0,977.0,297.0,580.81,1205.0,361.5,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2022,78.7,25.6,6.0,15.0,977.0,297.0,655.36,153.6,384.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,2022,79.1,25.8,4.8,15.0,977.0,297.0,665.64,123.84,387.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2022,81.8,26.3,1.0,15.0,977.0,297.0,691.69,26.3,394.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Linear Regression

In [8]:
    #define X
    X = df.drop(columns='nuh')

    #define y
    y = df['nuh']

    #split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # create a preprocessing transformer to scale numeric cols
    ss = ColumnTransformer(
        transformers=[
            ('numeric', StandardScaler(), X.select_dtypes(include='number').columns)],
        remainder='passthrough')

    # Instantiate a pipeline.
    pipe = Pipeline([
        ('preprocessor', ss), # step1: transformation
        ('estimator', LinearRegression()) # step2: estimation
    ])

    #fit the pipe
    pipe.fit(X_train, y_train)

    #predict performance on X_test
    y_pred_test = pipe.predict(X_test)
    y_pred_train = pipe.predict(X_train)

    #evaluate model with RMSE and R2
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    train_r2 = r2_score(y_train,y_pred_train)
    test_r2 = r2_score(y_test,y_pred_test)

    #to see which coefficients are significant in influencing our linear regression model
    feature_names = pipe[:-1].get_feature_names_out()
    coefficients = pipe['estimator'].coef_
    sorted_coefficients = pd.DataFrame(sorted(zip(feature_names, coefficients), key=lambda x: abs(x[1]), reverse=True))

    #display results for model RMSE, R2 score and top coefficients
    print(f'For NUH')
    print(f'Linear Regression RMSE : {rmse:.4f}')
    print(f'Linear Regression Train R2 score: {train_r2:.4f}')
    print(f'Linear Regression Test R2 score: {test_r2:.4f}')
    print(f'Linear Regression Top Coefficients: \n{sorted_coefficients}\n\n\n')

For NUH
Linear Regression RMSE : 2.9420
Linear Regression Train R2 score: 0.6852
Linear Regression Test R2 score: 0.4515
Linear Regression Top Coefficients: 
                                                    0          1
0                                numeric__num__dengue -10.127304
1                              numeric__num__diarrhea  10.090801
2   numeric__num__wet_bulb_temperature total_rainfall   9.164985
3                        numeric__num__total_rainfall  -8.370088
4           numeric__num__wet_bulb_temperature dengue   7.992937
5                            numeric__num__diarrhea^2  -7.524196
6                       numeric__num__dengue diarrhea   6.881108
7                  numeric__num__wet_bulb_temperature  -6.725581
8                numeric__num__wet_bulb_temperature^2   6.552475
9         numeric__num__wet_bulb_temperature diarrhea  -3.787472
10                             numeric__num__dengue^2  -3.643303
11            numeric__num__wet_bulb_temperature urti   3.1480

### RidgeCV

In [9]:
    #define X
    X = df.drop(columns='nuh')

    #define y
    y = df['nuh']

    #split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # create a preprocessing transformer to scale numeric cols
    ss = ColumnTransformer(
        transformers=[
            ('numeric', StandardScaler(), X.select_dtypes(include='number').columns)],
        remainder='passthrough')

    # Instantiate a pipeline.
    pipe = Pipeline([
        ('preprocessor', ss), # step1: transformation
        ('estimator', RidgeCV(alphas=[0.1,1.0,10.0], cv=5)) # step2: estimation
    ])

    #fit the pipe
    pipe.fit(X_train, y_train)

    #predict performance on X_test
    y_pred_test = pipe.predict(X_test)
    y_pred_train = pipe.predict(X_train)

    #evaluate model with RMSE and R2
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    train_r2 = r2_score(y_train,y_pred_train)
    test_r2 = r2_score(y_test,y_pred_test)

    #to see which coefficients are significant in influencing our ridgecv model
    feature_names = pipe[:-1].get_feature_names_out()
    coefficients = pipe['estimator'].coef_
    sorted_coefficients = pd.DataFrame(sorted(zip(feature_names, coefficients), key=lambda x: abs(x[1]), reverse=True))

    #display results for model RMSE, R2 score and top coefficients
    print(f'For NUH')
    print(f'Ridge CV RMSE : {rmse:.4f}')
    print(f'Ridge CV Train R2 score: {train_r2:.4f}')
    print(f'Ridge CV Test R2 score: {test_r2:.4f}')
    print(f'Ridge CV Top Coefficients: \n{sorted_coefficients}\n\n\n')

For NUH
Ridge CV RMSE : 2.9901
Ridge CV Train R2 score: 0.6680
Ridge CV Test R2 score: 0.4335
Ridge CV Top Coefficients: 
                                                    0         1
0                             numeric__cat__month_Jan -1.831898
1                              numeric__num__dengue^2 -1.466491
2                          numeric__cat__is_holiday_1 -1.382745
3                             numeric__cat__month_Dec -1.111139
4                               numeric__cat__day_Sat -1.070305
5                             numeric__cat__month_Nov -1.001085
6                       numeric__num__dengue diarrhea  0.763119
7                               numeric__cat__day_Tue  0.731930
8                               numeric__cat__day_Thu  0.661096
9               numeric__num__total_rainfall diarrhea -0.649825
10                              numeric__cat__day_Wed  0.628403
11                           numeric__num__diarrhea^2 -0.600764
12                              numeric__cat__

### Lasso CV

In [10]:
    #define X
    X = df.drop(columns='nuh')

    #define y
    y = df['nuh']

    #split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # create a preprocessing transformer to scale numeric cols
    ss = ColumnTransformer(
        transformers=[
            ('numeric', StandardScaler(), X.select_dtypes(include='number').columns)],
        remainder='passthrough')

    # Instantiate a pipeline.
    pipe = Pipeline([
        ('preprocessor', ss), # step1: transformation
        ('estimator', LassoCV(alphas=[0.1,1.0,10.0], cv=5)) # step2: estimation
    ])

    #fit the pipe
    pipe.fit(X_train, y_train)

    #predict performance on X_test
    y_pred_test = pipe.predict(X_test)
    y_pred_train = pipe.predict(X_train)

    #evaluate model with RMSE and R2
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    train_r2 = r2_score(y_train,y_pred_train)
    test_r2 = r2_score(y_test,y_pred_test)

    #to see which coefficients are significant in influencing our lassocv model
    feature_names = pipe[:-1].get_feature_names_out()
    coefficients = pipe['estimator'].coef_
    sorted_coefficients = pd.DataFrame(sorted(zip(feature_names, coefficients), key=lambda x: abs(x[1]), reverse=True))

    #display results for model RMSE, R2 score and top coefficients
    print(f'For NUH')
    print(f'Lasso CV RMSE : {rmse:.4f}')
    print(f'Lasso CV Train R2 score: {train_r2:.4f}')
    print(f'Lasso CV Test R2 score: {test_r2:.4f}')
    print(f'Lasso CV Top Coefficients: \n{sorted_coefficients}\n\n\n')

For NUH
Lasso CV RMSE : 2.9939
Lasso CV Train R2 score: 0.6411
Lasso CV Test R2 score: 0.4320
Lasso CV Top Coefficients: 
                                                    0         1
0                             numeric__cat__month_Jan -1.677425
1                          numeric__cat__is_holiday_1 -1.324496
2                               numeric__cat__day_Sat -1.175736
3                             numeric__cat__month_Dec -0.768231
4                             numeric__cat__month_Nov -0.766785
5                               numeric__cat__day_Sun -0.586446
6                                numeric__num__urti^2  0.573837
7                               numeric__cat__day_Tue  0.522907
8                               numeric__cat__day_Thu  0.440310
9                               numeric__cat__day_Wed  0.403396
10                            numeric__cat__month_May  0.319651
11                              numeric__cat__day_Mon  0.306679
12                            numeric__cat__mo

### Insights from the Models above

The following table summarises the metrics from the models above.


**Metrics :**
|     Metrics    | Linear Regression | Ridge CV | Lasso CV |
|:--------------:|:-----------------:|:--------:|----------|
| Train R2 Score |       0.6462      |  0.6678  |  0.6435  |
|  Test R2 Score |       0.3993      |  0.4296  |  0.4327  |
|      RMSE      |       3.0788      |  3.0002  |  2.9922  |


**Insights :**
All three models agree : time (specifically in months) affect the availability of hospital beds the most (has the largest coefficients)

After regularisation, 
* May is the month with the largest positive impact on occupancy
* January is the month with the largest negative impact on occupancy
* Public holidays and weekends tend to have lower occupancy rates