1. Dataset

  The dataset you will be using for this lab session: https://www.kaggle.com/datasets/muhammadbinimran/housing-price-prediction-data/code

2. Splitting the Data:

  Experiment with two different data splits:
  * 80-20 split
  * 70-30 split

3. Multiple Linear Regression and Random Forest:


4. Benchmarking:

  Compare and benchmark the performance of the two classification models. Document your observations.


6. **Recall**: Creating Python functions to streamline your Machine Learning process (code reusability)

# How to make functions in Python?

- Use the `def` keyword, followed by the name of your function.
- Inside the `()` includes your function parameters (inputs).
```
def add_numbers(a, b):
    return a+b
```
```
# to call this function:
add_numbers(5, 7) # expected output should be 12
```

### More examples:
```
def
```

# Import the libraries

In [None]:
import pandas as pd
import numpy as np
import sklearn.preprocessing as preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

# Loading Data

In [None]:
# Function to load data
def load_data(filepath):
    df = pd.read_csv(filepath)
    return df

# Call the load_data function
df = load_data('housing_price_dataset.csv')

# Display the first few rows of the dataframe to confirm it loaded correctly
print(df.head())


   SquareFeet  Bedrooms  Bathrooms Neighborhood  YearBuilt          Price
0        2126         4          1        Rural       1969  215355.283618
1        2459         3          2        Rural       1980  195014.221626
2        1860         2          1       Suburb       1970  306891.012076
3        2294         2          1        Urban       1996  206786.787153
4        2130         5          2       Suburb       2001  272436.239065


I choose not to download from kaggle directly as datasets change or are removed.

# Preprocessing

In [None]:
# Function for preprocessing data
def preprocess_data(df):
    df['Price'] = df['Price'].abs()
    df['Price'].describe()  # You can optionally print or log this if needed
    df = df.iloc[:10000, :]
    new_neighborhood = preprocessing.LabelEncoder().fit_transform(df['Neighborhood'])
    df['Neighborhood'] = new_neighborhood
    return df

# Call the preprocess_data function
df = preprocess_data(df)

# Display the first few rows of the preprocessed dataframe
print(df.head())

   SquareFeet  Bedrooms  Bathrooms  Neighborhood  YearBuilt          Price
0        2126         4          1             0       1969  215355.283618
1        2459         3          2             0       1980  195014.221626
2        1860         2          1             1       1970  306891.012076
3        2294         2          1             2       1996  206786.787153
4        2130         5          2             1       2001  272436.239065


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Neighborhood'] = new_neighborhood


Return absolute value for Price getting rid of negative value in 'Price'. Get first 10,000 rows for tuning. Transform 'Neighborhood'.

Feature Selection

In [None]:
# Define features and target variable
X = df.drop('Price', axis=1)  # Features
y = df['Price']               # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Function for feature selection
def select_features(X_train, y_train, X_test, k=3):
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train = selector.fit_transform(X_train, y_train)
    X_test = selector.transform(X_test)
    return X_train, X_test

# Call select_features function
X_train_selected, X_test_selected = select_features(X_train, y_train, X_test, k=3)



# Multiple Linear Regression and Random Forest Regression

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Function to SPLIT data
def split_data(X, y, test_sizes: list, random_state=None):
    splits = {}
    for test_size in test_sizes:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state
        )
        splits[test_size] = {
            "X_train": X_train,
            "X_test": X_test,
            "y_train": y_train,
            "y_test": y_test,
        }
    return splits

# Generalized function to TRAIN and PREDICT with any model
def train_and_predict(model, X_train, y_train, X_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Function to calculate EVALUATION metrics
def calculate_metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    return mae, mse, rmse, r2, mape

# Function to EVALUATE multiple models with different splits
def evaluate_models(X, y, test_sizes, models, random_state=None):
    data_splits = split_data(X, y, test_sizes, random_state)

    all_results = {}
    for model_name, model in models.items():
        results = {}
        for test_size, split in data_splits.items():
            # Train and predict for TRAINING set
            y_train_pred = train_and_predict(model, split['X_train'], split['y_train'], split['X_train'])
            y_test_pred = train_and_predict(model, split['X_train'], split['y_train'], split['X_test'])

            # Calculate metrics for TRAINING set
            train_mae, train_mse, train_rmse, train_r2, train_mape = calculate_metrics(split['y_train'], y_train_pred)
            # Calculate metrics for TEST set
            test_mae, test_mse, test_rmse, test_r2, test_mape = calculate_metrics(split['y_test'], y_test_pred)

            results[test_size] = {
                'Train': {
                    'MAE': train_mae,
                    'MSE': train_mse,
                    'RMSE': train_rmse,
                    'R^2': train_r2,
                    'MAPE': train_mape
                },
                'Test': {
                    'MAE': test_mae,
                    'MSE': test_mse,
                    'RMSE': test_rmse,
                    'R^2': test_r2,
                    'MAPE': test_mape
                }
            }
        all_results[model_name] = results
    return all_results

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest Regression': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Usage
test_sizes = [0.2, 0.3]
results = evaluate_models(X, y, test_sizes, models, random_state=42)

# Display results for each model and test size
for model_name, res in results.items():
    print()
    print(f"Evaluation for {model_name}:")
    for test_size, metrics in res.items():
        print(f"\nTest Size: {test_size}")
        print(f"Train - MAE: {metrics['Train']['MAE']:.4f}, MSE: {metrics['Train']['MSE']:.4f}, RMSE: {metrics['Train']['RMSE']:.4f}, R^2: {metrics['Train']['R^2']:.4f}, MAPE: {metrics['Train']['MAPE']:.2f}%")
        print(f"Test  - MAE: {metrics['Test']['MAE']:.4f}, MSE: {metrics['Test']['MSE']:.4f}, RMSE: {metrics['Test']['RMSE']:.4f}, R^2: {metrics['Test']['R^2']:.4f}, MAPE: {metrics['Test']['MAPE']:.2f}%")



Evaluation for Linear Regression:

Test Size: 0.2
Train - MAE: 40356.5020, MSE: 2549096747.4459, RMSE: 50488.5804, R^2: 0.5551, MAPE: 24.51%
Test  - MAE: 40765.4492, MSE: 2564777435.9967, RMSE: 50643.6317, R^2: 0.5732, MAPE: 27.40%

Test Size: 0.3
Train - MAE: 40427.5430, MSE: 2552133934.3296, RMSE: 50518.6494, R^2: 0.5548, MAPE: 24.42%
Test  - MAE: 40475.3061, MSE: 2553086568.9581, RMSE: 50528.0770, R^2: 0.5680, MAPE: 26.79%

Evaluation for Random Forest Regression:

Test Size: 0.2
Train - MAE: 15843.0191, MSE: 403223965.3826, RMSE: 20080.4374, R^2: 0.9296, MAPE: 9.49%
Test  - MAE: 43045.7923, MSE: 2887255915.9704, RMSE: 53733.1919, R^2: 0.5195, MAPE: 28.83%

Test Size: 0.3
Train - MAE: 15959.9025, MSE: 409183953.9074, RMSE: 20228.2959, R^2: 0.9286, MAPE: 9.53%
Test  - MAE: 42995.9320, MSE: 2909144390.0158, RMSE: 53936.4848, R^2: 0.5078, MAPE: 28.22%


**Linear Regression:**

MAE, MSE, RMSE values are slightly higher for the test set. R^2 is lower on the test set. MAPE is slightly higher for the test set. The 80 20 split has a slightly better performance on the test set with lower error metrics and higher R^2.

**Random Forest Regression:**

MAE, RMSE and MAPE are higher on the test set indicating prediction errors are larger on the test set. The R^2 is lower on the test data suggesting it does not generalize well to the unseen data. The 70 30 split is marginally better in performance, but still lower than its training performance.

Both show decrease in performance on the test set compared to the training set.

##Benchmarking

<!DOCTYPE html>
<html>
<head>
    <style>
        table {
            width: 100%;
            border-collapse: collapse;
        }
        th, td {
            border: 1px solid #ddd;
            padding: 8px;
        }
        th {
            background-color: #f2f2f2;
            text-align: center;
        }
        .model-header {
            background-color: #e0e0e0;
            text-align: center;
            font-weight: bold;
        }
        .model-subheader {
            background-color: #d0d0d0;
            text-align: center;
            font-weight: bold;
        }
        .metrics-header {
            background-color: #ffffff;
            text-align: center;
        }
        .split-row {
            background-color: #f9f9f9;
        }
        .divider {
            border-bottom: 2px solid #ddd;
        }
    </style>
</head>
<body>

<h2>RandomForestRegressor vs LinearRegression</h2>

<table>
    <thead>
        <tr>
            <th>Split</th>
            <th>Metric</th>
            <th class="model-header" colspan="2">RandomForestRegressor</th>
            <th class="model-header" colspan="2">LinearRegression</th>
        </tr>
        <tr>
            <th></th>
            <th></th>
            <th class="model-subheader">Train</th>
            <th class="model-subheader">Test</th>
            <th class="model-subheader">Train</th>
            <th class="model-subheader">Test</th>
        </tr>
    </thead>
    <tbody>
        <tr class="split-row">
            <td>80:20</td>
            <td>MAE</td>
            <td>15843.0191</td>
            <td>43045.7923</td>
            <td>40356.5020</td>
            <td>40765.4492</td>
        </tr>
        <tr>
            <td></td>
            <td>MSE</td>
            <td>403223965.3826</td>
            <td>2887255915.9704</td>
            <td>2549096747.4459</td>
            <td>2564777435.9967</td>
        </tr>
        <tr>
            <td></td>
            <td>RMSE</td>
            <td>20080.4374</td>
            <td>53733.1919</td>
            <td>50488.5804</td>
            <td>50643.6317</td>
        </tr>
        <tr>
            <td></td>
            <td>R²</td>
            <td>0.9296</td>
            <td>0.5195</td>
            <td>0.5551</td>
            <td>0.5732</td>
        </tr>
        <tr>
            <td></td>
            <td>MAPE</td>
            <td>9.49%</td>
            <td>28.83%</td>
            <td>24.51%</td>
            <td>27.40%</td>
        </tr>
        <tr class="divider">
            <td colspan="6"></td>
        </tr>
        <tr class="split-row">
            <td>70:30</td>
            <td>MAE</td>
            <td>15959.9025</td>
            <td>42995.9320</td>
            <td>40427.5430</td>
            <td>40475.3061</td>
        </tr>
        <tr>
            <td></td>
            <td>MSE</td>
            <td>409183953.9074</td>
            <td>2909144390.0158</td>
            <td>2552133934.3296</td>
            <td>2553086568.9581</td>
        </tr>
        <tr>
            <td></td>
            <td>RMSE</td>
            <td>20228.2959</td>
            <td>53936.4848</td>
            <td>50518.6494</td>
            <td>50528.0770</td>
        </tr>
        <tr>
            <td></td>
            <td>R²</td>
            <td>0.9286</td>
            <td>0.5078</td>
            <td>0.5548</td>
            <td>0.5680</td>
        </tr>
        <tr>
            <td></td>
            <td>MAPE</td>
            <td>9.53%</td>
            <td>28.22%</td>
            <td>24.42%</td>
            <td>26.79%</td>
        </tr>
    </tbody>
</table>

</body>
</html>


**Summary**


Linear Regression performs with moderate overfitting as indicated by the higher test set error metrics and lower R^2 values. The 80:20 split slightly outperforms the 70:30 split.

Random Forest Regressor demonstrates even higher test set error metrics and lower R^2 values, indicating a more significant generalization issue. The 80:20 split provides marginally better performance compared to the 70:30 split.

**Overall Comparison:**

Linear Regression offers more stable and less complex predictions with consistent performance, although with moderate overfitting.
    
Random Forest provides highly accurate training predictions but struggles with test data generalization, showing significant overfitting and reduced performance on unseen data.

Both models have their strengths and weaknesses. Linear Regression may be preferable for its stability and interpretability, while Random Forest might be reconsidered with additional tuning to improve generalization.

##Hyperparameter Tuning

Default Parameters

In [None]:
def get_default_lr_params():
    model = LinearRegression()
    default_params = model.get_params()
    print("\nDefault Hyperparameters for LinearRegression:")
    for param, value in default_params.items():
        print(f"{param}: {value}")
    return default_params

default_lr_params = get_default_lr_params()

def get_default_rf_params():
    model = RandomForestRegressor(random_state=23)
    default_params = model.get_params()
    print("\nDefault Hyperparameters for RandomForestRegressor:")
    for param, value in default_params.items():
        print(f"{param}: {value}")
    return default_params

default_rf_params = get_default_rf_params()


Default Hyperparameters for LinearRegression:
copy_X: True
fit_intercept: True
n_jobs: None
positive: False

Default Hyperparameters for RandomForestRegressor:
bootstrap: True
ccp_alpha: 0.0
criterion: squared_error
max_depth: None
max_features: 1.0
max_leaf_nodes: None
max_samples: None
min_impurity_decrease: 0.0
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
n_estimators: 100
n_jobs: None
oob_score: False
random_state: 23
verbose: 0
warm_start: False


Parameter Grid

In [None]:
def tune_linear_regression(X_train, y_train):
    # Define the parameter grid for LinearRegression
    param_grid = {
        'fit_intercept': [True, False],
        'copy_X': [True, False],
        'positive': [True, False]
    }

    # Initialize the model
    model = LinearRegression()

    # Create GridSearchCV instance
    grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, error_score='raise')

    # Fit model to perform search
    grid_search.fit(X_train, y_train)

    # Get best hyperparameters from the search
    best_params = grid_search.best_params_

    # Print the best hyperparameters
    print("\nBest Hyperparameters for LinearRegression:")
    for param, value in best_params.items():
        print(f"{param}: {value}")

    return best_params

# Call the function to tune Linear Regression
best_lr_params = tune_linear_regression(X_train, y_train)


Best Hyperparameters for LinearRegression:
copy_X: True
fit_intercept: True
positive: True


In [None]:
def tune_random_forest(X_train, y_train):
    # Define the parameter grid RandomForestRegressor
    param_grid = {
        'n_estimators': [50, 100, 200],         # Number of trees in the forest
        'max_depth': [None, 10, 20],            # Maximum depth of the tree
        'min_samples_split': [2, 5, 10],        # Minimum number of samples required to split an internal node
    }

    # Initialize the model
    model = RandomForestRegressor(random_state=23)

    # Create GridSearchCV instance
    grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, error_score='raise')

    # Fit model to perform search
    grid_search.fit(X_train, y_train)

    # Get best hyperparameters from search
    best_params = grid_search.best_params_

    # Print best hyperparameters
    print("\nBest Hyperparameters for RandomForestRegressor:")
    for param, value in best_params.items():
        print(f"{param}: {value}")

    return best_params

# Call the function to tune Random Forest Regressor
best_rf_params = tune_random_forest(X_train, y_train)



Best Hyperparameters for RandomForestRegressor:
max_depth: 10
min_samples_split: 10
n_estimators: 200


Redefine models with the best parameters, training, and predicting.

This block is for my testing only:

In [None]:
def evaluate_model(y_true, y_pred):
    """
    Evaluate the performance of a model using various metrics.

    Parameters:
    y_true (array-like): True values.
    y_pred (array-like): Predicted values.

    Returns:
    tuple: MAE, MSE, RMSE, R², MAPE
    """
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    return mae, mse, rmse, r2, mape

def train_predict_evaluate(X, y, test_size, best_lr_params, best_rf_params):
    """
    Train, predict, and evaluate Linear Regression and Random Forest models.

    Parameters:
    X (array-like): Feature data.
    y (array-like): Target data.
    test_size (float): Proportion of data to be used as test set.
    best_lr_params (dict): Best hyperparameters for Linear Regression.
    best_rf_params (dict): Best hyperparameters for Random Forest.
    """
    # SPLIT the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=23)

    # Initialize models with best hyperparameters
    # ** unpacks dictionary into keyword arguments when initializing model:
    # lr_model = LinearRegression(**best_lr_params)
    # then pulls {'fit_intercept': True, 'copy_X': True, 'positive': True} from best_lr_params = tune_linear_regression(X_train, y_train)
    lr_model = LinearRegression(**best_lr_params)
    rf_model = RandomForestRegressor(**best_rf_params, random_state=23)

    print("Training Linear Regression...")
    # TRAIN models
    lr_model.fit(X_train, y_train)
    print("Training Random Forest...")
    rf_model.fit(X_train, y_train)

    print("Predicting with Linear Regression...")
    # PREDICT
    lr_predictions = lr_model.predict(X_test)
    print("Predicting with Random Forest...")
    rf_predictions = rf_model.predict(X_test)

    print("Evaluating Linear Regression...")
    # EVALUATE
    lr_train_preds = lr_model.predict(X_train)
    lr_train_metrics = evaluate_model(y_train, lr_train_preds)
    lr_test_metrics = evaluate_model(y_test, lr_predictions)

    print("Evaluating Random Forest...")
    rf_train_preds = rf_model.predict(X_train)
    rf_train_metrics = evaluate_model(y_train, rf_train_preds)
    rf_test_metrics = evaluate_model(y_test, rf_predictions)

    # Print results
    print(f"\nEvaluation for Linear Regression:\n")
    print(f"Test Size: {test_size}")
    print(f"Train - MAE: {lr_train_metrics[0]:.4f}, MSE: {lr_train_metrics[1]:.4f}, RMSE: {lr_train_metrics[2]:.4f}, R²: {lr_train_metrics[3]:.4f}, MAPE: {lr_train_metrics[4]:.2f}%")
    print(f"Test  - MAE: {lr_test_metrics[0]:.4f}, MSE: {lr_test_metrics[1]:.4f}, RMSE: {lr_test_metrics[2]:.4f}, R²: {lr_test_metrics[3]:.4f}, MAPE: {lr_test_metrics[4]:.2f}%")

    print(f"\nEvaluation for Random Forest Regression:\n")
    print(f"Test Size: {test_size}")
    print(f"Train - MAE: {rf_train_metrics[0]:.4f}, MSE: {rf_train_metrics[1]:.4f}, RMSE: {rf_train_metrics[2]:.4f}, R²: {rf_train_metrics[3]:.4f}, MAPE: {rf_train_metrics[4]:.2f}%")
    print(f"Test  - MAE: {rf_test_metrics[0]:.4f}, MSE: {rf_test_metrics[1]:.4f}, RMSE: {rf_test_metrics[2]:.4f}, R²: {rf_test_metrics[3]:.4f}, MAPE: {rf_test_metrics[4]:.2f}%")

# Call the function for both splits
train_predict_evaluate(X, y, test_size=0.2, best_lr_params=best_lr_params, best_rf_params=best_rf_params)  # 80:20 split
train_predict_evaluate(X, y, test_size=0.3, best_lr_params=best_lr_params, best_rf_params=best_rf_params)  # 70:30 split


Training Linear Regression...
Training Random Forest...
Predicting with Linear Regression...
Predicting with Random Forest...
Evaluating Linear Regression...
Evaluating Random Forest...

Evaluation for Linear Regression:

Test Size: 0.2
Train - MAE: 40332.0570, MSE: 2531183459.4308, RMSE: 50310.8682, R²: 0.5632, MAPE: 2479.34%
Test  - MAE: 40857.8027, MSE: 2636110412.8051, RMSE: 51343.0659, R²: 0.5417, MAPE: 2578.58%

Evaluation for Random Forest Regression:

Test Size: 0.2
Train - MAE: 34213.0806, MSE: 1817997301.7465, RMSE: 42637.9796, R²: 0.6862, MAPE: 2097.63%
Test  - MAE: 41542.4950, MSE: 2725642623.0416, RMSE: 52207.6874, R²: 0.5261, MAPE: 2616.49%
Training Linear Regression...
Training Random Forest...
Predicting with Linear Regression...
Predicting with Random Forest...
Evaluating Linear Regression...
Evaluating Random Forest...

Evaluation for Linear Regression:

Test Size: 0.3
Train - MAE: 40106.5543, MSE: 2503198525.0461, RMSE: 50031.9750, R²: 0.5660, MAPE: 2450.53%
Test  - 

In [None]:
def evaluate_model(y_true, y_pred):
    """
    Evaluate the performance of a model using various metrics.

    Parameters:
    y_true (array-like): True values.
    y_pred (array-like): Predicted values.

    Returns:
    tuple: MAE, MSE, RMSE, R², MAPE
    """
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)

    # Adjust MAPE calculation to handle cases where y_true might be zero
    epsilon = 1e-8  # Small constant to avoid division by zero
    mape = np.mean(np.abs((y_true - y_pred) / np.maximum(np.abs(y_true), epsilon))) * 100

    return mae, mse, rmse, r2, mape

def train_predict_evaluate(X, y, test_size, best_lr_params, best_rf_params):
    """
    Train, predict, and evaluate Linear Regression and Random Forest models.

    Parameters:
    X (array-like): Feature data.
    y (array-like): Target data.
    test_size (float): Proportion of data to be used as test set.
    best_lr_params (dict): Best hyperparameters for Linear Regression.
    best_rf_params (dict): Best hyperparameters for Random Forest.
    """
    # SPLIT the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=23)

    # Initialize models with best hyperparameters
    # ** Unpacks dictionary into keyword arguments when initializing model:
    #    lr_model = LinearRegression(**best_lr_params)
    #    then pulls {'fit_intercept': True, 'copy_X': True, 'positive': True}
    #    from best_lr_params = tune_linear_regression(X_train, y_train)
    lr_model = LinearRegression(**best_lr_params)
    rf_model = RandomForestRegressor(**best_rf_params, random_state=23)

    # TRAIN models
    lr_model.fit(X_train, y_train)
    rf_model.fit(X_train, y_train)

    # PREDICT
    lr_train_preds = lr_model.predict(X_train)
    lr_test_preds = lr_model.predict(X_test)
    rf_train_preds = rf_model.predict(X_train)
    rf_test_preds = rf_model.predict(X_test)

    # EVALUATE
    lr_train_metrics = evaluate_model(y_train, lr_train_preds)
    lr_test_metrics = evaluate_model(y_test, lr_test_preds)
    rf_train_metrics = evaluate_model(y_train, rf_train_preds)
    rf_test_metrics = evaluate_model(y_test, rf_test_preds)

    # Print results
    print(f"\nEvaluation for Linear Regression:\n")
    print(f"Test Size: {test_size}")
    print(f"Train - MAE: {lr_train_metrics[0]:.4f}, MSE: {lr_train_metrics[1]:.4f}, RMSE: {lr_train_metrics[2]:.4f}, R²: {lr_train_metrics[3]:.4f}, MAPE: {lr_train_metrics[4]:.2f}%")
    print(f"Test  - MAE: {lr_test_metrics[0]:.4f}, MSE: {lr_test_metrics[1]:.4f}, RMSE: {lr_test_metrics[2]:.4f}, R²: {lr_test_metrics[3]:.4f}, MAPE: {lr_test_metrics[4]:.2f}%")

    print(f"\nEvaluation for Random Forest Regression:\n")
    print(f"Test Size: {test_size}")
    print(f"Train - MAE: {rf_train_metrics[0]:.4f}, MSE: {rf_train_metrics[1]:.4f}, RMSE: {rf_train_metrics[2]:.4f}, R²: {rf_train_metrics[3]:.4f}, MAPE: {rf_train_metrics[4]:.2f}%")
    print(f"Test  - MAE: {rf_test_metrics[0]:.4f}, MSE: {rf_test_metrics[1]:.4f}, RMSE: {rf_test_metrics[2]:.4f}, R²: {rf_test_metrics[3]:.4f}, MAPE: {rf_test_metrics[4]:.2f}%")

# Call the function for both splits
train_predict_evaluate(X, y, test_size=0.2, best_lr_params=best_lr_params, best_rf_params=best_rf_params)  # 80:20 split
train_predict_evaluate(X, y, test_size=0.3, best_lr_params=best_lr_params, best_rf_params=best_rf_params)  # 70:30 split



Evaluation for Linear Regression:

Test Size: 0.2
Train - MAE: 40332.0570, MSE: 2531183459.4308, RMSE: 50310.8682, R²: 0.5632, MAPE: 24.79%
Test  - MAE: 40857.8027, MSE: 2636110412.8051, RMSE: 51343.0659, R²: 0.5417, MAPE: 25.79%

Evaluation for Random Forest Regression:

Test Size: 0.2
Train - MAE: 34213.0806, MSE: 1817997301.7465, RMSE: 42637.9796, R²: 0.6862, MAPE: 20.98%
Test  - MAE: 41542.4950, MSE: 2725642623.0416, RMSE: 52207.6874, R²: 0.5261, MAPE: 26.16%

Evaluation for Linear Regression:

Test Size: 0.3
Train - MAE: 40106.5543, MSE: 2503198525.0461, RMSE: 50031.9750, R²: 0.5660, MAPE: 24.51%
Test  - MAE: 41222.8582, MSE: 2668651785.3561, RMSE: 51658.9952, R²: 0.5417, MAPE: 26.17%

Evaluation for Random Forest Regression:

Test Size: 0.3
Train - MAE: 33647.0758, MSE: 1758677570.8875, RMSE: 41936.5899, R²: 0.6951, MAPE: 20.49%
Test  - MAE: 41867.1083, MSE: 2749298093.3533, RMSE: 52433.7496, R²: 0.5278, MAPE: 26.46%


## Benchmark


## Comprehensive Model Performance Comparison

| **Metric** | **LinearReg** <br> 80:20 <br> Before Tuning <br> Train | **LinearReg** <br> 80:20 <br> Before Tuning <br> Test | **LinearReg** <br> 80:20 <br> After Tuning <br> Train | **LinearReg** <br> 80:20 <br> After Tuning <br> Test | **LinearReg** <br> 70:30 <br> Before Tuning <br> Train | **LinearReg** <br> 70:30 <br> Before Tuning <br> Test | **LinearReg** <br> 70:30 <br> After Tuning <br> Train | **LinearReg** <br> 70:30 <br> After Tuning <br> Test | **RandomForReg** <br> 80:20 <br> Before Tuning <br> Train | **RandomForReg** <br> 80:20 <br> Before Tuning <br> Test | **RandomForReg** <br> 80:20 <br> After Tuning <br> Train | **RandomForReg** <br> 80:20 <br> After Tuning <br> Test | **RandomForReg** <br> 70:30 <br> Before Tuning <br> Train | **RandomForReg** <br> 70:30 <br> Before Tuning <br> Test | **RandomForReg** <br> 70:30 <br> After Tuning <br> Train | **RandomForReg** <br> 70:30 <br> After Tuning <br> Test |
|------------|--------------------------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|
| **MAE**    | 40,356.50                                                     | 40,765.45                                                     | 40,332.06                                                   | 40,857.80                                                   | 40,427.54                                                     | 40,475.31                                                     | 40,106.55                                                   | 41,222.86                                                   | 15,843.02                                                     | 43,045.79                                                     | 34,213.08                                                   | 41,542.50                                                   | 15,959.90                                                     | 42,995.93                                                     | 33,647.08                                                   | 41,867.11                                                   |
| **MSE**    | 2,549,096,747.45                                               | 2,564,777,435.99                                               | 2,531,183,459.43                                             | 2,636,110,412.81                                             | 2,552,133,934.33                                               | 2,553,086,568.96                                               | 2,503,198,525.05                                             | 2,668,651,785.36                                             | 403,223,965.38                                               | 2,887,255,915.97                                               | 1,817,997,301.75                                             | 2,725,642,623.04                                             | 407,012,108.16                                               | 2,915,664,938.01                                               | 1,758,677,570.89                                             | 2,749,298,093.35                                             |
| **RMSE**   | 50,488.58                                                     | 50,643.63                                                     | 50,310.87                                                   | 51,343.07                                                   | 50,518.65                                                     | 50,528.08                                                     | 50,031.98                                                   | 51,658.99                                                   | 20,080.44                                                     | 53,733.19                                                     | 42,637.98                                                   | 52,207.69                                                   | 20,228.30                                                     | 53,996.90                                                     | 41,936.59                                                   | 52,433.75                                                   |
| **R²**    | 0.5551                                                       | 0.5732                                                       | 0.5632                                                     | 0.5417                                                     | 0.5548                                                       | 0.5680                                                       | 0.5660                                                     | 0.5417                                                     | 0.9296                                                       | 0.5195                                                       | 0.6862                                                     | 0.5261                                                     | 0.9286                                                       | 0.5078                                                       | 0.6951                                                     | 0.5278                                                     |
| **MAPE**   | 24.51%                                                       | 27.40%                                                       | 24.79%                                                     | 25.79%                                                     | 24.42%                                                       | 26.79%                                                       | 24.51%                                                     | 26.17%                                                     | 9.49%                                                        | 28.83%                                                       | 20.98%                                                     | 26.16%                                                     | 9.53%                                                        | 28.22%                                                       | 20.49%                                                     | 26.46%                                                     |


##**Summary**

*Linear Regression:*
        
The performance metrics for Linear Regression remained mostly unchanged after tuning. The MAE, MSE, RMSE, R^2, and MAPE values are similar before and after tuning, indicating that the tuning process did not significantly affect the Linear Regression model's performance.

*Random Forest Regressor:*
- Before Tuning:
  - Shows very high performance on the training data with a very high R^2 and low MAE, MSE, and RMSE values. However, the performance on the test data was much lower, indicating potential overfitting.
- After Tuning:
  - The performance improved on the test data, with better R^2 and lower MAE values compared to before tuning. However, the model still exhibits a significant difference between training and test metrics, suggesting room for further improvement.

Overall, tuning had a more substantial impact on the Random Forest Regressor, improving its test performance, while the Linear Regression metrics remained relatively stable. The Random Forest Regressor at the 70:30 split appears to perform better after tuning considering the metrics are slightly better on the test set, except for the R^2 which is minimally lower.