# Grid Search Experiments

This file contains the code for the experiments using Grid Search (the baseline comparison with no tool) on both the classification and regression datasets.

In [9]:
# Import required modlules
import time
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

In [10]:
# Set random seed
RANDOM_SEED = 3

In [11]:
# Function for calculating elapsed time
def print_elapsed_time(start, end):
    elapsed_time = end - start
    minutes = int(elapsed_time // 60)
    seconds = int(elapsed_time % 60)
    print("Elapsed time: {} minutes, {} seconds".format(minutes, seconds))

## Hospital Readmissions (Classification)

In this section, we run grid search on our classification dataset. To do so, we have to define a grid of parameters to try, which is defined below as `param_grid`. Then, sklearn will try all combinations of these specified parameters to see which one performs best. 

In [12]:
# Read in data
readmissions = pd.read_csv("../data/classification/readmissions_clean.csv")

# Split into X and Y
X = readmissions.drop('readmitted', axis=1)
y = readmissions["readmitted"]

# Split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = RANDOM_SEED)


In [13]:
# Fit Random Forest
rf = RandomForestClassifier(random_state=RANDOM_SEED)

# Choose a parameter grid for grid search
param_grid = {
    'max_depth': [5, 7, 10, 15],
    'max_features': [3, 5, 8, 10],
    'n_estimators': [50, 150, 200, 300]
}

# Time grid search
start = time.time()

grid_search_classifier = GridSearchCV(estimator = rf, param_grid = param_grid)
grid_search_classifier.fit(X_train, y_train)

end = time.time()


We see that the grid search optimization took 9 minutes and 24 seconds:

In [14]:
# Display time elapsed
print_elapsed_time(start,end)

Elapsed time: 9 minutes, 24 seconds


We can view the optimal set of parameters that was found by the grid search:

In [15]:
# Get optimal parameters
grid_search_classifier.best_params_

{'max_depth': 5, 'max_features': 3, 'n_estimators': 200}

The best trial has the following parameters, so now we use this optimized set to apply again to our model and get final optimization scores:

`{'max_depth': 5, 'max_features': 3, 'n_estimators': 200}`


In [16]:
# Fit model with optimal parameters
rf = RandomForestClassifier(random_state=RANDOM_SEED, max_depth = 5, max_features = 3, n_estimators = 200)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Get performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.6278
Precision: 0.6252028123309897
Recall: 0.4974182444061962


## Car Emissions (Regression)

For regression, we complete the exact same steps, but using `RandomForestRegressor` as our model instead of `RandomForestClassifier`

In [17]:
# Read in data
emissions = pd.read_csv("../data/regression/emissions_cleaned.csv")

# Split into X and Y
X = emissions.drop('co2_emissions', axis=1)
y = emissions["co2_emissions"]

# Split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 3)


In [18]:
# Fit Random Forest
rf = RandomForestRegressor(random_state=RANDOM_SEED)

# Choose a parameter grid for grid search
param_grid = {
    'max_depth': [5, 7, 10, 15],
    'max_features': [3, 5, 8, 10],
    'n_estimators': [50, 150, 200, 300]
}

# Time grid search
start = time.time()

grid_search_regressor = GridSearchCV(estimator = rf, param_grid = param_grid)
grid_search_regressor.fit(X_train, y_train)

end = time.time()

We can see that the grid search optimization took 2 minutes and 47 seconds:

In [19]:
# Display time elapsed
print_elapsed_time(start,end)

Elapsed time: 2 minutes, 47 seconds


Below, we view the optimal parameters found in the grid search from all combinations of the ones provided in the grid:

In [20]:
# Get optimal parameters
grid_search_regressor.best_params_


{'max_depth': 15, 'max_features': 10, 'n_estimators': 300}

The best trial had the following parameters, so now we plug these back in to our model to get our final metrics:

`{'max_depth': 15, 'max_features': 10, 'n_estimators': 300}`

In [28]:
# Fit model with optimal parameters
rf = RandomForestRegressor(random_state=3, max_depth = 15, max_features = 10, n_estimators = 300)

rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print('Mean Absolute Error (MAE):', mean_absolute_error(y_test, y_pred))
print('Mean Absolute Percentage Error (MAPE):', mean_absolute_percentage_error(y_test, y_pred))
print('Mean Squared Error (MSE):', mean_squared_error(y_test, y_pred))

Mean Absolute Error (MAE): 1.6924411810082374
Mean Absolute Percentage Error (MAPE): 0.007012699729565289
Mean Squared Error (MSE): 10.143196795641945
