### Problem Statement:

Research Question: Can the performance of random grid sampling for hyperparameter optimization be enhanced by incorporating a surrogate model?

Before we can begin the practical part of the experiment, we need to ensure that our development environment is properly set up. This includes having the necessary libraries and tools installed. We'll primarily be using Python for this experiment, along with some popular libraries.

Here's a list of what you'll need:

Python: You should have Python installed on your system. If not, you can download and install it from the official Python website (https://www.python.org/).

Jupyter Notebook: Jupyter Notebook is an excellent environment for running Python code interactively.


### Required Libraries: Will be using libraries like pandas, scikit-learn, and XGBoost. 

In [64]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import xgboost as xgb
import random
import warnings
from xgboost import XGBRegressor
from IPython.display import display
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.experimental import enable_hist_gradient_boosting 
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score



### Step 1: Data Preparation

- Load and preprocess your dataset.
- Split the dataset into training and testing sets. Typically, an 80-20 or 70-30 split is used.

In [65]:
## Loading the Data and showing the dimension or shape of the dataset

df = pd.read_csv('adverts.csv')

df.shape


(402005, 12)

Summary
The raw dataset contains 402,005 records.
It consists of 12 columns, each representing different aspects of information about vehicles.

### Checking Correct Parsing of Data

In [66]:
# Checking dataframe structure (i.e. columns and its datatypes) 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402005 entries, 0 to 402004
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   public_reference       402005 non-null  int64  
 1   mileage                401878 non-null  float64
 2   reg_code               370148 non-null  object 
 3   standard_colour        396627 non-null  object 
 4   standard_make          402005 non-null  object 
 5   standard_model         402005 non-null  object 
 6   vehicle_condition      402005 non-null  object 
 7   year_of_registration   368694 non-null  float64
 8   price                  402005 non-null  int64  
 9   body_type              401168 non-null  object 
 10  crossover_car_and_van  402005 non-null  bool   
 11  fuel_type              401404 non-null  object 
dtypes: bool(1), float64(2), int64(2), object(7)
memory usage: 34.1+ MB


In [67]:
# It was observed the crossover_car_and_van column wasn't recognized as a categorical feature it's explicitly converted to an object 
df['crossover_car_and_van'] = df['crossover_car_and_van'].astype('object')
df = df.drop(columns='reg_code')


In [68]:
# Review dataframe and its associated data.
df.head()


Unnamed: 0,public_reference,mileage,standard_colour,standard_make,standard_model,vehicle_condition,year_of_registration,price,body_type,crossover_car_and_van,fuel_type
0,202006039777689,0.0,Grey,Volvo,XC90,NEW,,73970,SUV,False,Petrol Plug-in Hybrid
1,202007020778260,108230.0,Blue,Jaguar,XF,USED,2011.0,7000,Saloon,False,Diesel
2,202007020778474,7800.0,Grey,SKODA,Yeti,USED,2017.0,14000,SUV,False,Petrol
3,202007080986776,45000.0,Brown,Vauxhall,Mokka,USED,2016.0,7995,Hatchback,False,Diesel
4,202007161321269,64000.0,Grey,Land Rover,Range Rover Sport,USED,2015.0,26995,SUV,False,Diesel


To gain an initial understanding of the dataset while controlling computational costs, a representative random sample of 20,000 records is chosen from a larger dataset of 402,005 records, mitigating potential biases.

 From my observation, the data types are appropriate (i.e. numeric for quantitative features, object for qualitative features).

### COLUMN DESCRIPTION
    Public_reference: An integer datatype representing a reference also known as Vehicle Identification Number (VIN).
    Mileage: A float datatype indicating the mileage of the vehicle.
    Reg_code: An object datatype representing registration code which has both two digits and letters as age identifier 
    standard_colour: An object indicating the vehicle's color.
    standard_make: An object representing the manufacturer of the vehicle.
    standard_model: An object describing the vehicle's model.
    vehicle_condition: An object indicating the condition of the vehicle.
    year_of_registration: A float showing the year of vehicle registration.
    price: An integer representing the price of the vehicle.
    body_type: An object describing the body type of the vehicle.
    crossover_car_and_van: A boolean indicating if it's a crossover between a car and a van.
    fuel_type: An object indicating the type of fuel used by the vehicle.

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402005 entries, 0 to 402004
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   public_reference       402005 non-null  int64  
 1   mileage                401878 non-null  float64
 2   standard_colour        396627 non-null  object 
 3   standard_make          402005 non-null  object 
 4   standard_model         402005 non-null  object 
 5   vehicle_condition      402005 non-null  object 
 6   year_of_registration   368694 non-null  float64
 7   price                  402005 non-null  int64  
 8   body_type              401168 non-null  object 
 9   crossover_car_and_van  402005 non-null  object 
 10  fuel_type              401404 non-null  object 
dtypes: float64(2), int64(2), object(7)
memory usage: 33.7+ MB


In [71]:
df = df.drop(['public_reference'], axis=1)

In [72]:
def remove_outliers1(df, column_name):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR
    # Filter the DataFrame to exclude outliers
    df_filtered = (df[df[column_name] < upper_bound]).reset_index(drop=True)
    return df_filtered

In [73]:
def remove_outliers2(df, column_name):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    # Filter the DataFrame to exclude lower outliers
    df_filtered = (df[df[column_name] > lower_bound]).reset_index(drop=True)
    return df_filtered


In [75]:
df = df.dropna()

In [76]:
df.isna().sum()

mileage                  0
standard_colour          0
standard_make            0
standard_model           0
vehicle_condition        0
year_of_registration     0
price                    0
body_type                0
crossover_car_and_van    0
fuel_type                0
dtype: int64

In [77]:
df['vehicle_condition'].value_counts()

vehicle_condition
USED    346297
Name: count, dtype: int64

In [78]:
df = df.drop(['vehicle_condition'], axis=1)

In [74]:
df = remove_outliers1(df, 'mileage')

df = remove_outliers2(df, 'year_of_registration')

In [79]:
# Taking a subset of the dataset
# creating a random subset dataframe to use. 
df = df.sample(n=20000, random_state=42)  

# ds = data.sample(frac=0.05)

df.shape

(20000, 9)

### Identify Quantitative and Qualitative Features

In [80]:
# Separating columns into quantitative and qualitative features
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
categorical_columns = df.select_dtypes(exclude=np.number).columns.tolist()

print("Quantitative Features:", numeric_columns)
print("Qualitative Features:", categorical_columns)

Quantitative Features: ['mileage', 'year_of_registration', 'price']
Qualitative Features: ['standard_colour', 'standard_make', 'standard_model', 'body_type', 'crossover_car_and_van', 'fuel_type']


## Data Distributions

### For Numerical Features

In [81]:
# Distribution of numerical features showing the Central Tendency and Variability
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
df[numeric_columns].describe()

Unnamed: 0,mileage,year_of_registration,price
count,20000.0,20000.0,20000.0
mean,37810.9568,2015.5754,15862.75925
std,29839.804993,3.329094,19236.618595
min,0.0,2006.0,350.0
25%,14000.0,2014.0,7495.0
50%,30062.0,2016.0,11995.0
75%,56000.0,2018.0,18692.5
max,126445.0,2020.0,844995.0


In [82]:
# Obtaining numerical features from the dataset.
numeric_columns = df.select_dtypes(include=['number']).columns

# Initialize an empty DataFrame to store summary statistics
summary_stats = pd.DataFrame(columns=['Mean', 'Mode', 'Median', 'Maximum', 'Minimum', 'Standard Deviation', 'Variance', 'Range'])

# Iterate through numeric features in the DataFrame
for column in numeric_columns:
    # Calculate statistics for the currently addressed column
    column_mean = df[column].mean()
    column_mode = df[column].mode().iloc[0]
    column_median = df[column].median()
    column_max = df[column].max()
    column_min = df[column].min()
    column_stand = df[column].std()
    column_var = df[column].var()
    
    # Calculate the range
    column_range = column_max - column_min

    # Appending the statistics to the summary DataFrame
    summary_stats.loc[column] = [column_mean, column_mode, column_median, column_max, column_min, column_stand, column_var, column_range]

# To display the summary statistics DataFrame
display(summary_stats)


Unnamed: 0,Mean,Mode,Median,Maximum,Minimum,Standard Deviation,Variance,Range
mileage,37810.9568,10.0,30062.0,126445.0,0.0,29839.804993,890414000.0,126445.0
year_of_registration,2015.5754,2017.0,2016.0,2020.0,2006.0,3.329094,11.08287,14.0
price,15862.75925,7995.0,11995.0,844995.0,350.0,19236.618595,370047500.0,844645.0


In [83]:
X = df

dataset = X
# changing the strings to categories
for label, content in X.items():
    if pd.api.types.is_string_dtype(content):
        X[label] = content.astype('category').cat.as_ordered()
# Turn the categorical columns to numeric encoding and then fill missing values

for label, content in X.items():
    if not pd.api.types.is_numeric_dtype(content):
        # 1 is added to replace the missing values. 
        X[label] = pd.Categorical(content).codes + 1

In [84]:
X.head()

Unnamed: 0,mileage,standard_colour,standard_make,standard_model,year_of_registration,price,body_type,crossover_car_and_van,fuel_type
157531,32578.0,9,4,70,2017.0,13798,6,1,2
246242,69624.0,9,51,477,2016.0,5991,6,1,6
135298,35528.0,9,4,406,2015.0,30650,5,1,6
191706,35000.0,17,41,25,2013.0,4450,6,1,6
289450,44361.0,2,55,377,2012.0,6895,6,1,6


In [85]:
df = X.dropna()

df.head()

Unnamed: 0,mileage,standard_colour,standard_make,standard_model,year_of_registration,price,body_type,crossover_car_and_van,fuel_type
157531,32578.0,9,4,70,2017.0,13798,6,1,2
246242,69624.0,9,51,477,2016.0,5991,6,1,6
135298,35528.0,9,4,406,2015.0,30650,5,1,6
191706,35000.0,17,41,25,2013.0,4450,6,1,6
289450,44361.0,2,55,377,2012.0,6895,6,1,6


### The objective is to predict car prices based on the given features.

In [86]:
# Seperating the independent variable from the target variable 
X = df.drop(columns=["price"])
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 1: Initial Baseline Model

- Train and evaluate an initial baseline machine learning model using default hyperparameters (e.g., a Random Forest model).
- Record key performance metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2).
- Measure the computation time required for training and evaluation.

In [87]:
# Assuming you have already split your data into X_train, X_test, y_train, and y_test

# Define and train the baseline Random Forest model with default hyperparameters
baseline_model = RandomForestRegressor()  # You can adjust the hyperparameters here

# Start time
start_time = time.time()

baseline_model.fit(X_train, y_train)

# Make predictions with the baseline model
y_baseline_pred = baseline_model.predict(X_test)

# End time
end_time = time.time()

# Calculate MSE, RMSE, MAE, and R-squared for the baseline model
mse_baseline = mean_squared_error(y_test, y_baseline_pred)
rmse_baseline = np.sqrt(mse_baseline)
mae_baseline = mean_absolute_error(y_test, y_baseline_pred)
r2_baseline = r2_score(y_test, y_baseline_pred)

# Calculate computation time
computation_time_baseline = end_time - start_time

# Print the baseline model's evaluation metrics and computation time
print(f"Mean Squared Error (Baseline): {mse_baseline}")
print(f"Root Mean Squared Error (Baseline): {rmse_baseline}")
print(f"Mean Absolute Error (Baseline): {mae_baseline}")
print(f"R-squared (Baseline): {r2_baseline}")
print(f"Computation Time (Baseline): {computation_time_baseline} seconds")


Mean Squared Error (Baseline): 109130798.26067553
Root Mean Squared Error (Baseline): 10446.568731438832
Mean Absolute Error (Baseline): 3153.7890407143755
R-squared (Baseline): 0.7168836579679748
Computation Time (Baseline): 15.821324348449707 seconds


### Step 3: Define Hyperparameter Search Space

- Define a dictionary of hyperparameters and their possible values. This is called the hyperparameter search space.
- Example: 
  ```python
  param_dist = {
      'n_estimators': [50, 100, 200],
      'max_depth': [None, 10, 20, 30],
      'min_samples_split': [2, 5, 10],
      'min_samples_leaf': [1, 2, 4],
      'bootstrap': [True, False]
  }

In [88]:
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}


This dictionary param_dist contains hyperparameters as keys and lists of possible values as their associated values. You can adjust the values in these lists based on your specific requirements and the algorithms you're using. This search space will be used in subsequent steps for hyperparameter tuning.

### Step 4: Hyperparameter Optimization

- Choose a hyperparameter optimization method. In this case, we'll use Random Grid Search.
- Perform Random Grid Search to find the best hyperparameters within the defined search space.
- Record the best hyperparameters and the corresponding model performance metrics.


In [89]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid for Random Grid Search
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Create a RandomForestRegressor
rf = RandomForestRegressor()

# Initialize lists to store hyperparameters and metrics
hyperparameters_list = []
mse_list = []
rmse_list = []
mae_list = []
r2_list = []

# Perform Random Grid Search and collect data
n_iterations = 10  # Number of random combinations to try

for _ in range(n_iterations):
    # Randomly sample hyperparameters
    random_params = {
        'n_estimators': np.random.choice(param_dist['n_estimators']),
        'max_depth': np.random.choice(param_dist['max_depth']),
        'min_samples_split': np.random.choice(param_dist['min_samples_split']),
        'min_samples_leaf': np.random.choice(param_dist['min_samples_leaf']),
        'bootstrap': np.random.choice(param_dist['bootstrap'])
    }

    # Train a model with the sampled hyperparameters
    rf = RandomForestRegressor(**random_params)
    rf.fit(X_train, y_train)

    # Make predictions
    y_pred = rf.predict(X_test)

    # Calculate evaluation metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Store hyperparameters and metrics in lists
    hyperparameters_list.append(random_params)
    mse_list.append(mse)
    rmse_list.append(rmse)
    mae_list.append(mae)
    r2_list.append(r2)

# Create a DataFrame to store the collected data
data = {
    'Hyperparameters': hyperparameters_list,
    'Mean Squared Error': mse_list,
    'Root Mean Squared Error': rmse_list,
    'Mean Absolute Error': mae_list,
    'R-squared': r2_list
}
df = pd.DataFrame(data)

# Find the best hyperparameters
best_hyperparameters = df.loc[df['Mean Squared Error'].idxmin()]['Hyperparameters']

# Print the best hyperparameters and their corresponding evaluation metrics
print("Best Hyperparameters:")
print(best_hyperparameters)
print("Best Mean Squared Error:", df['Mean Squared Error'].min())

# Save the DataFrame to a CSV file (optional)
df.to_csv('hyperparameter_results.csv', index=False)


Best Hyperparameters:
{'n_estimators': 100, 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'bootstrap': True}
Best Mean Squared Error: 99108512.99375771


Above code is performing Random Grid Search for hyperparameter optimization using the RandomForestRegressor from Scikit-Learn: In this code, we use Random Grid Search (RandomizedSearchCV) to search for the best hyperparameters within the defined search space. The best hyperparameters and evaluation metrics for the best model are printed at the end. 

Here:
We perform the Random Grid Search and, for each iteration, recorded the sampled hyperparameters and corresponding evaluation metrics (MSE, RMSE, MAE, R2) in separate lists.
After all iterations, a DataFrame (df) was created to store the collected data.
Found the best hyperparameters based on the lowest MSE and printed them.
The DataFrame was saved to a CSV file for further analysis.

### Step 5: Updated Baseline Model

- Train and evaluate a new baseline model using the best hyperparameters obtained from the Random Grid Search.
- Record performance metrics (MSE, RMSE, MAE, R2) and computation time for this updated baseline model.

Training and evaluating a new baseline model using the best hyperparameters obtained from the Random Grid Search: 

In [90]:
# Define and train a new baseline Random Forest model with the best hyperparameters
best_hyperparameters = {
    'n_estimators': 100,  # Insert the best hyperparameters obtained from Random Grid Search
    'max_depth': 20,
    'min_samples_split': 5,
    'min_samples_leaf': 1,
    'bootstrap': True,
    'random_state': 42
}

baseline_model = RandomForestRegressor(**best_hyperparameters)

# Start time
start_time = time.time()

baseline_model.fit(X_train, y_train)

# Make predictions with the new baseline model
y_baseline_pred = baseline_model.predict(X_test)

# End time
end_time = time.time()

# Calculate evaluation metrics for the new baseline model
mse_baseline = mean_squared_error(y_test, y_baseline_pred)
rmse_baseline = np.sqrt(mse_baseline)
mae_baseline = mean_absolute_error(y_test, y_baseline_pred)
r2_baseline = r2_score(y_test, y_baseline_pred)

# Calculate computation time
computation_time_baseline = end_time - start_time

# Print the new baseline model's evaluation metrics and computation time
print(f"Mean Squared Error (Updated Baseline): {mse_baseline}")
print(f"Root Mean Squared Error (Updated Baseline): {rmse_baseline}")
print(f"Mean Absolute Error (Updated Baseline): {mae_baseline}")
print(f"R-squared (Updated Baseline): {r2_baseline}")
print(f"Computation Time (Updated Baseline): {computation_time_baseline} seconds")


Mean Squared Error (Updated Baseline): 106996527.65470083
Root Mean Squared Error (Updated Baseline): 10343.91258928172
Mean Absolute Error (Updated Baseline): 3215.3377349827197
R-squared (Updated Baseline): 0.7224205631908863
Computation Time (Updated Baseline): 11.720519304275513 seconds


Here:

A new baseline Random Forest model was created using the best hyperparameters obtained from Random Grid Search.<p>
Trained the model using the training data.
Made predictions on the test data and calculated evaluation metrics (MSE, RMSE, MAE, R2) for the new baseline model.
Measured the computation time it took for training and evaluation and printed the results.
Basically, Step 5 involved training and evaluating a new baseline model with the optimized hyperparameters.

The `best_hyperparameters` dictionary from the previous stage was used because it contains the set of hyperparameters that resulted in the best model performance during the Random Grid Search. These hyperparameters are considered the most optimal for the given machine learning problem based on the search space you defined.

The idea here is to take the best-performing hyperparameters obtained from the hyperparameter optimization step (Random Grid Search) and use them to train an updated baseline model. This updated baseline model should theoretically perform better than the initial baseline model with default hyperparameters because it's using hyperparameters that were specifically selected to improve performance.

In other words, we are leveraging the knowledge gained from the hyperparameter optimization step to build a better-performing baseline model. This helps establish a benchmark for evaluating whether more advanced techniques, such as surrogate modeling, can further improve model performance.

### Step 6: Surrogate Model Training (Optional)

- Choosing a surrogate model (i.e. XGBoost) for the next part of the experiment.
- Training the surrogate model using the hyperparameters and their corresponding performance metrics from the Random Grid Search.

To train a surrogate model (in this case, XGBoost) using the hyperparameters and their corresponding performance metrics from the Random Grid Search

In [91]:
# Define the hyperparameters used in surrogate model training
surrogate_hyperparameters = {
    'objective': 'reg:squarederror',
    'n_estimators': 100,
    'max_depth': 5,
    'learning_rate': 0.1
}

# Create an XGBoost regressor with the same hyperparameters
xgb_model = XGBRegressor(**surrogate_hyperparameters)

# Start time
start_time = time.time()

# Train the XGBoost model on the training data
xgb_model.fit(X_train, y_train)

# Make predictions using the XGBoost model
y_xgb_pred = xgb_model.predict(X_test)

# End time
end_time = time.time()

# Calculate evaluation metrics for the surrogate model
mse_xgb = mean_squared_error(y_test, y_xgb_pred)
rmse_xgb = np.sqrt(mse_xgb)
mae_xgb = mean_absolute_error(y_test, y_xgb_pred)
r2_xgb = r2_score(y_test, y_xgb_pred)

# Calculate computation time
computation_time_xgb = end_time - start_time

# Print the surrogate model's evaluation metrics and computation time
print(f"Mean Squared Error (XGBoost): {mse_xgb}")
print(f"Root Mean Squared Error (XGBoost): {rmse_xgb}")
print(f"Mean Absolute Error (XGBoost): {mae_xgb}")
print(f"R-squared (XGBoost): {r2_xgb}")
print(f"Computation Time (XGBoost): {computation_time_xgb} seconds")


Mean Squared Error (XGBoost): 100597299.67831352
Root Mean Squared Error (XGBoost): 10029.820520742807
Mean Absolute Error (XGBoost): 3778.5317840967177
R-squared (XGBoost): 0.7390219813549521
Computation Time (XGBoost): 1.0200319290161133 seconds


Here: 
Surrogate_hyperparameters was defined, which should be the same hyperparameters used when training the surrogate model during Step 6.
An XGBoost regressor was created using these hyperparameters.
The XGBoost model was trained on the training data.
Predictions were made and evaluated the surrogate model's performance using metrics such as MSE, RMSE, MAE, R2, and computation time.
Tried to ensure consistency between the hyperparameters used in surrogate model training and evaluation.

The code above sets up an XGBoost regressor, trains it on the training data, makes predictions and calculates the Mean Squared Error (MSE) as a performance metric for the surrogate model and as well other matrices. 

### Step 7: Surrogate-Guided Hyperparameter Optimization (Optional)

- Using the surrogate model, predict the performance of different hyperparameter configurations sampled randomly.
- Select the hyperparameter configuration with the best predicted performance as the next candidate.
- Repeat this process for a certain number of iterations or until a stopping criterion is met.
- Record the best hyperparameters and their corresponding performance metrics.

Performing Surrogate-Guided Hyperparameter Optimization using the XGBoost surrogate 

In [92]:

# Ignore XGBoost warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the search space for hyperparameters
search_space = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Number of random hyperparameters to sample
num_samples = 100  # You can adjust this number

# Initialize a list to store the sampled hyperparameters
sampled_hyperparameters = []

# Getting a list of randomly sampled hyperparameters.
for _ in range(num_samples):
    random_hyperparams = {
        'n_estimators': random.choice(search_space['n_estimators']),
        'max_depth': random.choice(search_space['max_depth']),
        'min_samples_split': random.choice(search_space['min_samples_split']),
        'min_samples_leaf': random.choice(search_space['min_samples_leaf']),
        'bootstrap': random.choice(search_space['bootstrap'])
    }
    sampled_hyperparameters.append(random_hyperparams)

# Create empty lists to store the performance metrics
mse_scores = []
rmse_scores = []
mae_scores = []
r2_scores = []

# Create an empty list to store the predicted scores
predicted_scores = []

# Predict with the surrogate model (XGBoost)
for hyperparameters in sampled_hyperparameters:
    # Create an XGBoost regressor with the specified hyperparameters
    xgb_params = {
        'objective': 'reg:squarederror',
        'n_estimators': hyperparameters['n_estimators'],
        'max_depth': hyperparameters['max_depth'],
        'min_samples_split': hyperparameters['min_samples_split'],
        'min_samples_leaf': hyperparameters['min_samples_leaf'],
        'bootstrap': hyperparameters['bootstrap']
    }
    
    xgb_model = xgb.XGBRegressor(**xgb_params)  # Create XGBoost regressor
    
    # Start time for computation time measurement
    start_time = time.time()
    
    # Train XGBoost model with the same training data used for the surrogate model
    xgb_model.fit(X_train, y_train)
    
    # Make predictions with the surrogate model (XGBoost)
    y_xgb_pred = xgb_model.predict(X_test)
    
    # Calculate performance metrics
    mse_xgb = mean_squared_error(y_test, y_xgb_pred)
    rmse_xgb = np.sqrt(mse_xgb)
    mae_xgb = mean_absolute_error(y_test, y_xgb_pred)
    r2_xgb = r2_score(y_test, y_xgb_pred)
    
    # End time for computation time measurement
    end_time = time.time()
    
    # Calculate computation time
    computation_time_xgb = end_time - start_time
    
    # Append metrics to lists
    mse_scores.append(mse_xgb)
    rmse_scores.append(rmse_xgb)
    mae_scores.append(mae_xgb)
    r2_scores.append(r2_xgb)

    # Add the performance score to the list
    predicted_scores.append(mse_xgb)

# Find the hyperparameters with the best predicted score (lowest MSE)
best_hyperparameters_idx = np.argmin(predicted_scores)
best_hyperparameters = sampled_hyperparameters[best_hyperparameters_idx]

# Print the best hyperparameters and performance metrics
print("Best Hyperparameters:", best_hyperparameters)
print(f"Mean Squared Error (Best Model): {mse_scores[best_hyperparameters_idx]}")
print(f"Root Mean Squared Error (Best Model): {rmse_scores[best_hyperparameters_idx]}")
print(f"Mean Absolute Error (Best Model): {mae_scores[best_hyperparameters_idx]}")
print(f"R-squared (Best Model): {r2_scores[best_hyperparameters_idx]}")
print(f"Computation Time (XGBoost): {computation_time_xgb} seconds")


Best Hyperparameters: {'n_estimators': 200, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 2, 'bootstrap': True}
Mean Squared Error (Best Model): 84575298.35262437
Root Mean Squared Error (Best Model): 9196.482933851636
Mean Absolute Error (Best Model): 2846.664996134758
R-squared (Best Model): 0.7805876115863577
Computation Time (XGBoost): 1.502157211303711 seconds


### Step 8: Final Evaluation

- Train and evaluate a model using the best hyperparameters obtained from the Surrogate-Guided Optimization.
- Record performance metrics (MSE, RMSE, MAE, R2) and computation time for this final model.

Perform the final evaluation using the best hyperparameters obtained from either the Random Grid Search or the Surrogate-Guided Optimization.

In [93]:
# Define the best hyperparameters (replace with your actual best hyperparameters)
best_hyperparameters = {
    'n_estimators': 100,
    'max_depth': None,
    'min_samples_split': 5,
    'min_samples_leaf': 1,
    'bootstrap': False
}

# Create a RandomForestRegressor with the best hyperparameters
final_model = RandomForestRegressor(**best_hyperparameters, random_state=42)

# Start time
start_time = time.time()

# Train the final model on the training data
final_model.fit(X_train, y_train)

# Make predictions with the final model
y_final_pred = final_model.predict(X_test)

# End time
end_time = time.time()

# Calculate evaluation metrics for the final model
mse_final = mean_squared_error(y_test, y_final_pred)
rmse_final = np.sqrt(mse_final)
mae_final = mean_absolute_error(y_test, y_final_pred)
r2_final = r2_score(y_test, y_final_pred)

# Calculate computation time
computation_time_final = end_time - start_time

# Print the final model's evaluation metrics and computation time
print(f"Mean Squared Error (Final Model): {mse_final}")
print(f"Root Mean Squared Error (Final Model): {rmse_final}")
print(f"Mean Absolute Error (Final Model): {mae_final}")
print(f"R-squared (Final Model): {r2_final}")
print(f"Computation Time (Final Model): {computation_time_final} seconds")


Mean Squared Error (Final Model): 209865964.36988282
Root Mean Squared Error (Final Model): 14486.751339409497
Mean Absolute Error (Final Model): 3938.6594433333335
R-squared (Final Model): 0.45554797457360086
Computation Time (Final Model): 15.165497541427612 seconds


replacing the best_hyperparameters dictionary with your actual best hyperparameters obtained from either the Random Grid Search or the Surrogate-Guided Optimization. This code will train the final model using these best hyperparameters and evaluate its performance on the test data, recording various performance metrics and computation time.