# Optuna Experiments

This file contains the code for the experiments conducted using Optuna on both the classification and regression datasets.

In [7]:
# Import required modules
import optuna
import time
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, accuracy_score, precision_score, recall_score

In [8]:
# Set random seed
RANDOM_SEED = 3

In [9]:
# Function for calculating elapsed time
def print_elapsed_time(start, end):
    elapsed_time = end - start
    minutes = int(elapsed_time // 60)
    seconds = int(elapsed_time % 60)
    print("Elapsed time: {} minutes, {} seconds".format(minutes, seconds))

## Hospital Readmissions (Classification)

In this section, we run Optuna on our classification dataset. In Optuna, we define an objective we are trying to maximize or minimize (in this example, accuracy score) and then create a study that runs some number of trials (here 100) in order to attempt to optimize based on our specified objective.

In [10]:
# Read in data
readmissions = pd.read_csv('../data/classification/readmissions_clean.csv')

# Split dataset into X and Y
X = readmissions.drop(['readmitted'], axis=1)
y = readmissions.readmitted

# splitting X and Y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=RANDOM_SEED, test_size=0.2)

In [13]:
# Define an objective to maximize or minimize (here, we maximize accuracy)
def objective(trial):
    # Use ranges of parameters equal to the range covered by grid search
    n_estimators = trial.suggest_int('n_estimators', 50, 300, 1)
    max_depth = trial.suggest_int('max_depth', 5, 15, 1)
    max_features = trial.suggest_int('max_features', 3, 10, 1)

    # Train and fit RFC
    rf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, max_features= max_features, random_state=RANDOM_SEED)
    rf.fit(X_train, y_train)

    # Make and score predictions
    pred=rf.predict(X_test)
    score = accuracy_score(y_test,pred)
    
    return score

# Run and time optimization
start = time.time()                                
study = optuna.create_study(
    direction='maximize',
    # storage and study name are used to generate the dashboard
    storage="sqlite:///db.sqlite3",  
    study_name="hospital-readmissions") 
study.optimize(objective, n_trials=100)
end = time.time()

[32m[I 2023-04-26 13:53:06,482][0m A new study created in RDB with name: hospital-readmissions[0m
[32m[I 2023-04-26 13:53:11,042][0m Trial 0 finished with value: 0.618 and parameters: {'n_estimators': 296, 'max_depth': 15, 'max_features': 5}. Best is trial 0 with value: 0.618.[0m
[32m[I 2023-04-26 13:53:12,551][0m Trial 1 finished with value: 0.6212 and parameters: {'n_estimators': 91, 'max_depth': 14, 'max_features': 6}. Best is trial 1 with value: 0.6212.[0m
[32m[I 2023-04-26 13:53:15,370][0m Trial 2 finished with value: 0.6284 and parameters: {'n_estimators': 186, 'max_depth': 11, 'max_features': 7}. Best is trial 2 with value: 0.6284.[0m
[32m[I 2023-04-26 13:53:17,114][0m Trial 3 finished with value: 0.6236 and parameters: {'n_estimators': 259, 'max_depth': 5, 'max_features': 5}. Best is trial 2 with value: 0.6284.[0m
[32m[I 2023-04-26 13:53:18,579][0m Trial 4 finished with value: 0.612 and parameters: {'n_estimators': 75, 'max_depth': 15, 'max_features': 7}. Best 

We see that the Optuna optimization took the following time to run:

In [14]:
# Display time elapsed
print_elapsed_time(start,end)

Elapsed time: 3 minutes, 25 seconds


We can also can view the optimal parameters that Optuna found across 100 trials:

In [15]:
# Display results of best trial
study.best_trial

FrozenTrial(number=64, state=TrialState.COMPLETE, values=[0.63], datetime_start=datetime.datetime(2023, 4, 26, 13, 55, 22, 491190), datetime_complete=datetime.datetime(2023, 4, 26, 13, 55, 24, 378638), params={'max_depth': 8, 'max_features': 5, 'n_estimators': 206}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'max_depth': IntDistribution(high=15, log=False, low=5, step=1), 'max_features': IntDistribution(high=10, log=False, low=3, step=1), 'n_estimators': IntDistribution(high=300, log=False, low=50, step=1)}, trial_id=65, value=None)

In [16]:
study.best_params

{'max_depth': 8, 'max_features': 5, 'n_estimators': 206}

To open this in the optuna dashboard locally, we would need to follow these steps:

1. Make sure `optuna-dashboard` is installed with pip
2. In terminal/bash, make sure the current working directory is set to the one containing the experiment files (`/experiment`)
3. Run `optuna-dashboard sqlite:///db.sqlite3`
4. Open the web address provided when this function executes to view the dashboard

Next, we apply the best set of parameters to our model and get out the final metric scores for Optuna.

In [17]:
best_params = study.best_params

# Re-fit classifier with optimal parameters
rf = RandomForestClassifier(max_depth=best_params["max_depth"] ,max_features=best_params["max_features"] ,n_estimators=best_params["n_estimators"], random_state=RANDOM_SEED)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Calculate and print metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))

Accuracy: 0.6288
Precision: 0.6320541760722348
Recall: 0.4819277108433735


## Car Emissions Data (Regression)

For regression, we complete similar steps, instead using one of our regression-specific metrics, mean squared error (MSE). To optimize this metric, we want it to be as small as possible, so instead of maximizing for the objective (like we did with classification), we minimize.

In [18]:
# Read in data
emissions = pd.read_csv("../data/regression/emissions_cleaned.csv")

# Split dataset into X and Y
X = emissions.drop('co2_emissions', axis=1)
y = emissions["co2_emissions"]

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=3, test_size=0.2)

In [19]:
# Define an objective to maximize or minimize (here, we minimize MSE)

def objective(trial):
    # Use ranges of parameters equal to the range covered by grid search
    n_estimators = trial.suggest_int('n_estimators', 50, 300, 1)
    max_depth = trial.suggest_int('max_depth', 5, 15, 1)
    max_features = trial.suggest_int('max_features', 3, 10, 1)
    
    # Train and fit RFR
    rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features= max_features, random_state=RANDOM_SEED)
    rf.fit(X_train, y_train)

    # Make and score predictions
    pred=rf.predict(X_test)
    score = mean_squared_error(y_test,pred)
    
    return score

# Run and time optimization
start = time.time()                              
study = optuna.create_study(
    direction='minimize',
    # Also add this study to the dashboard
    storage="sqlite:///db.sqlite3",  
    study_name="co2-emissions")
study.optimize(objective, n_trials=100)
end = time.time()

[32m[I 2023-04-26 14:01:29,876][0m A new study created in RDB with name: co2-emissions[0m
[32m[I 2023-04-26 14:01:30,315][0m Trial 0 finished with value: 11.456047861947715 and parameters: {'n_estimators': 88, 'max_depth': 14, 'max_features': 4}. Best is trial 0 with value: 11.456047861947715.[0m
[32m[I 2023-04-26 14:01:30,718][0m Trial 1 finished with value: 35.07473945082315 and parameters: {'n_estimators': 140, 'max_depth': 6, 'max_features': 6}. Best is trial 0 with value: 11.456047861947715.[0m
[32m[I 2023-04-26 14:01:32,095][0m Trial 2 finished with value: 11.056369719828819 and parameters: {'n_estimators': 293, 'max_depth': 13, 'max_features': 6}. Best is trial 2 with value: 11.056369719828819.[0m
[32m[I 2023-04-26 14:01:32,429][0m Trial 3 finished with value: 42.69933085390503 and parameters: {'n_estimators': 106, 'max_depth': 5, 'max_features': 10}. Best is trial 2 with value: 11.056369719828819.[0m
[32m[I 2023-04-26 14:01:33,839][0m Trial 4 finished with valu

We see that the Optuna optimization took the following time to execute:

In [20]:
# Display time elapsed
print_elapsed_time(start,end)

Elapsed time: 1 minutes, 5 seconds


We can also can view the optimal parameters that Optuna found across 100 trials:

In [21]:
# Display results of best trial
study.best_trial

FrozenTrial(number=93, state=TrialState.COMPLETE, values=[9.664922253234332], datetime_start=datetime.datetime(2023, 4, 26, 14, 2, 32, 111779), datetime_complete=datetime.datetime(2023, 4, 26, 14, 2, 32, 563917), params={'max_depth': 15, 'max_features': 10, 'n_estimators': 60}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'max_depth': IntDistribution(high=15, log=False, low=5, step=1), 'max_features': IntDistribution(high=10, log=False, low=3, step=1), 'n_estimators': IntDistribution(high=300, log=False, low=50, step=1)}, trial_id=194, value=None)

In [22]:
study.best_params

{'max_depth': 15, 'max_features': 10, 'n_estimators': 60}

Finally, we use this optimized set of parameters and apply it to our model and get out the final metric scores for Optuna.

In [23]:
best_params = study.best_params

# Re-fit classifier with optimal parameters
rf = RandomForestRegressor(max_depth=best_params["max_depth"] ,max_features=best_params["max_features"] ,n_estimators=best_params["n_estimators"], random_state=RANDOM_SEED)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Calculate and print metrics
print('Mean Absolute Error (MAE):', mean_absolute_error(y_test, y_pred))
print('Mean Absolute Percentage Error (MAPE):', mean_absolute_percentage_error(y_test, y_pred))
print('Mean Squared Error (MSE):', mean_squared_error(y_test, y_pred))

Mean Absolute Error (MAE): 1.691188050128803
Mean Absolute Percentage Error (MAPE): 0.007015950042615298
Mean Squared Error (MSE): 9.664922253234332
