# Random Forest Model Training



## Jupyter Notebook Setup

I will begin training the Random Forest Model again. I'll first start by combining the numerical and categorical data together, before preparing to train the model.

In [2]:
%pip install ipykernel



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [60]:
# Loading necessary packages
%pip install scikit-learn numpy pandas
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error

# Loading math for square root operator
import math


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Importing numerical and categorical datasets
numerical_data = pd.read_csv('numerical_dataframe_hannah_nguyen.csv')
categorical_data = pd.read_csv('categorical_dataframe_hannah_nguyen_fixed.csv')

combined_data = pd.concat([numerical_data, categorical_data], axis = 1)
combined_data



Unnamed: 0,OriginalListPrice,ClosePrice,Latitude,Longitude,ListPrice,DaysOnMarket,YearBuilt,BedroomsTotal,Stories,LotSizeArea,...,Flooring_Brick,Flooring_Carpet,Flooring_Concrete,Flooring_Laminate,Flooring_Mixed,Flooring_SeeRemarks,Flooring_Stone,Flooring_Tile,Flooring_Vinyl,Flooring_Wood
0,759900.0,815000.0,32.659315,-117.096922,759900.0,33,2023.0,4.0,2.0,7035.0,...,0,0,0,0,0,0,0,0,0,1
1,4200.0,4500.0,32.614169,-116.986367,4500.0,33,2013.0,5.0,2.0,7035.0,...,0,0,0,0,1,0,0,0,0,0
2,5500.0,1200.0,33.605136,-117.897715,30575.0,1004,1984.0,3.0,1.0,2590.0,...,0,0,0,0,1,0,0,0,0,0
3,739900.0,810000.0,32.659284,-117.097174,770000.0,228,2023.0,4.0,2.0,7035.0,...,0,0,0,0,0,0,0,0,0,1
4,44000.0,17500.0,34.077850,-118.024179,36000.0,1089,2015.0,4.0,2.0,7035.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
504752,925000.0,925000.0,33.775773,-117.508224,925000.0,54,2003.0,5.0,2.0,7405.0,...,0,0,0,0,1,0,0,0,0,0
504753,599000.0,525000.0,34.227234,-116.429729,589000.0,79,1957.0,2.0,1.0,217800.0,...,0,0,0,0,0,0,0,1,0,0
504754,275000.0,313000.0,33.735287,-116.906467,310000.0,56,1964.0,2.0,1.0,10454.0,...,0,0,0,0,0,0,0,0,0,1
504755,825000.0,835000.0,34.000171,-117.500449,825000.0,16,2003.0,5.0,1.0,20038.0,...,0,0,0,0,1,0,0,0,0,0


In [68]:
# Exporting combined_data as a csv file
combined_data.to_csv('combined_data_cleaned_hannah_nguyen.csv', index = False)

## Model Training

### Splitting columns into features and targets

In [4]:
# Checking column names
combined_data.columns

Index(['OriginalListPrice', 'ClosePrice', 'Latitude', 'Longitude', 'ListPrice',
       'DaysOnMarket', 'YearBuilt', 'BedroomsTotal', 'Stories', 'LotSizeArea',
       'GarageSpaces', 'AssociationFee', 'PoolPrivateYN', 'FireplaceYN',
       'NewConstructionYN', 'City_freq', 'Level_One', 'Level_Two',
       'Level_ThreeOrMore', 'Level_MultiSplit', 'PropertyType_Encoded',
       'Flooring_Brick', 'Flooring_Carpet', 'Flooring_Concrete',
       'Flooring_Laminate', 'Flooring_Mixed', 'Flooring_SeeRemarks',
       'Flooring_Stone', 'Flooring_Tile', 'Flooring_Vinyl', 'Flooring_Wood'],
      dtype='object')

In [5]:
# Removing Column Names involving ListPrices and things limited only to houses on sale

combined_data_filtered = combined_data.drop(columns=["ListPrice", "OriginalListPrice", "DaysOnMarket"])

In [6]:
# Calculate the correlation matrix
corr_matrix = combined_data_filtered.corr(numeric_only=True)

# Display correlations with your target variable, e.g. "SalePrice"
target_corr = corr_matrix["ClosePrice"].sort_values(ascending=False)
print(target_corr)

ClosePrice              1.000000
BedroomsTotal           0.252931
FireplaceYN             0.137032
PoolPrivateYN           0.114722
Flooring_Mixed          0.085596
Stories                 0.085251
Level_ThreeOrMore       0.084387
Latitude                0.075368
Level_Two               0.071298
Level_MultiSplit        0.049206
AssociationFee          0.046234
GarageSpaces            0.025815
LotSizeArea             0.018549
City_freq               0.011943
Flooring_Wood           0.011745
Flooring_Stone          0.010329
YearBuilt               0.005743
NewConstructionYN      -0.000424
Flooring_Brick         -0.000676
Flooring_Concrete      -0.008571
Flooring_SeeRemarks    -0.017963
Flooring_Carpet        -0.025863
Flooring_Vinyl         -0.031347
Flooring_Tile          -0.035671
Flooring_Laminate      -0.060146
Level_One              -0.105312
Longitude              -0.110891
PropertyType_Encoded   -0.292684
Name: ClosePrice, dtype: float64


There are no more predictors with high correlation to the dependent variable. I will start splitting the columns into features or targets accordingly.

In [7]:
# Splitting columns into features or targets
X = combined_data_filtered.drop(columns = ['ClosePrice'])
Y = combined_data_filtered['ClosePrice']

### Splitting into Training and Testing

As of August 12, 2025, I changed it from an 80/20 split for training and testing data, to 64/16/20 for new_train, val, and testing data. Since the first three models have already seen the testing data cause of an earlier error, I will just be splitting the training data into two parts, new_train and val. 

new_train will be used to actually train the model, whereas val is used in the evaluation phase to hypertune the parameters. Then, testing data will be used once at the end when the model is finalized.

In [8]:
# 80% Training, 20% Testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

# 80% new_train, 20% val (64% new_train, 16% val, 20% testing)
X_new_train, X_val, Y_new_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=42)


# Checking number of rows
print(f"Training set size: {X_train.shape[0]} rows")
print(f"Testing set size: {X_test.shape[0]} rows")

Training set size: 403805 rows
Testing set size: 100952 rows


### Goal MAE and MSE

In [9]:
# The MAE is aiming for 10-20% of the mean
# Using RMSE, the root squared of mse, to easily interpret, it is also aiming for 10-20% of the mean

mean_y_new_train = Y_new_train.mean()
mean_y_val = Y_val.mean()
mean_y_test = Y_test.mean()

print(f"Y_new_train's mean is: {mean_y_new_train}\n")
print(f"Y_val's mean is: {mean_y_val}\n")
print(f"Y_test's mean is: {mean_y_test}\n")

goal_new_train = [mean_y_new_train * 0.1, mean_y_new_train * 0.2]
goal_val = [mean_y_val * 0.1, mean_y_val * 0.2]
goal_test = [mean_y_test*0.1, mean_y_test*0.2]

print(f"The goal MAE/RMSE for TRAINED DATA is between {goal_new_train[0]} and {goal_new_train[1]}\n")
print(f"The goal MAE/RMSE for VALIDATION DATA is between {goal_val[0]} and {goal_val[1]}\n")
print(f"The goal MAE/RMSE for TEST DATA is between {goal_test[0]} and {goal_test[1]}\n")

Y_new_train's mean is: 1013972.8653883374

Y_val's mean is: 1010753.2461734004

Y_test's mean is: 1014087.4000750853

The goal MAE/RMSE for TRAINED DATA is between 101397.28653883375 and 202794.5730776675

The goal MAE/RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

The goal MAE/RMSE for TEST DATA is between 101408.74000750855 and 202817.4800150171



### Training Random Forest Model

In [10]:
# creating a function to easily evaluate the model for future uses

def evaluate_model(Y_pred_val0, Y_pred_train0, num):
    # Calculating different metrics

    print(f"Training Model {num} Tests\n")
    print(f"The goal MAE and RMSE for TRAINING DATA is between {goal_new_train[0]} and {goal_new_train[1]}")
    print(f"The goal MAE and RMSE for VALIDATION DATA is between {goal_val[0]} and {goal_val[1]}\n")

    # Calculate Mean Absolute Error (MAE)
    mae_val = mean_absolute_error(Y_val, Y_pred_val0)
    mae_train = mean_absolute_error(Y_new_train, Y_pred_train0)
    print(f"Mean Absolute Error (VALIDATION DATA): {mae_val}")
    print(f"Mean Absolute Error (TRAINING DATA): {mae_train}\n")


    # Calculate Root Mean Squared Error (RMSE)
    mse_val = mean_squared_error(Y_val, Y_pred_val0)
    mse_train = mean_squared_error(Y_new_train, Y_pred_train0)
    print(f"Mean Squared Error (VALIDATION DATA): {mse_val}")
    print(f"Mean Squared Error (TRAINING DATA): {mse_train}\n")

    rmse_val = math.sqrt(mse_val)
    rmse_train = math.sqrt(mse_train)
    print(f"Root Mean Squared Error (VALIDATION DATA): {rmse_val}")
    print(f"Root Mean Squared Error (TRAINING DATA): {rmse_train}\n")

    # Calculate R-squared (goodness of fit)
    r2_val = r2_score(Y_val, Y_pred_val0)
    r2_train = r2_score(Y_new_train, Y_pred_train0)
    print(f"R-squared (VAL DATA): {r2_val}")
    print(f"R-squared (TRAINING DATA): {r2_train}")

### Default Model

In [11]:
# Initializing a model based on default parameters
rf_model_default = RandomForestRegressor(random_state = 42)


In [11]:
# Fit model using training data
rf_model_default.fit(X_new_train, Y_new_train)

In [12]:
# Making prediction based on model, using both trained data and test data
Y_pred_val_default = rf_model_default.predict(X_val)
Y_pred_train_default = rf_model_default.predict(X_new_train)

In [13]:
# Calculating different metrics
evaluate_model(Y_pred_val_default, Y_pred_train_default, 0)

Training Model 0 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 175669.06917163962
Mean Absolute Error (TRAINING DATA): 64975.514557016584

Mean Squared Error (VALIDATION DATA): 598518234212.1572
Mean Squared Error (TRAINING DATA): 128562932071.23537

Root Mean Squared Error (VALIDATION DATA): 773639.6022775444
Root Mean Squared Error (TRAINING DATA): 358556.73480111256

R-squared (VAL DATA): 0.6241970234687513
R-squared (TRAINING DATA): 0.9446250479945772


Based on the results, since the R^2 is higher on the training data and the MAE and RMSE are lower on the training data as well, then it suggest the model is overfitting. I will adjust the parameters to fix this issue using RandomizedSearchCV

Earlier, I had made an error where I was comparing the model's results using training and testing data, which indirectly led to the testing data being used as data to train the model. It affected the next 3 test models, which I marked as "archived" to preserve the work, but it will not be used in the final evaluation.

-- Archived --

Based on the results, it seems like the model is overfitting, it has memorized the patterns of the training data and is unable to generalize. I will try to train the model again using the best model from the previous training session.

### Archived Models (1 - 4)

#### Training Model 1 (Archived)


In [31]:
# Initializing model
rf_model1 = RandomForestRegressor(
    n_estimators=50,  # Set number of trees to 30
    max_depth=15,      # Limit tree depth to 15
    min_samples_split=10,  # Increase min samples for split
    min_samples_leaf=5,    # Increase min samples for leaf node
    max_features='sqrt',   # Use square root of features for splitting
    random_state=42,       # Ensure reproducibility
    n_jobs=-1              # Use all available cores
)

In [32]:
# Fit model using training data
rf_model1.fit(X_train, Y_train)

In [34]:
# Making prediction based on model
Y_pred_test1 = rf_model1.predict(X_test)
Y_pred_train1 = rf_model1.predict(X_train)

In [35]:
# Evaluating the model
evaluate_model(Y_pred_test1, Y_pred_train1, 1)

Training Model 1 Tests

The goal MAE and RMSE for TRAINING DATA is between 101332.894154535 and 202665.78830907
The goal MAE and RMSE for TESTING DATA is between 101408.74000750855 and 202817.4800150171

Mean Absolute Error (TEST DATA): 64071.070609973984
Mean Absolute Error (TRAINING DATA): 58101.572363619096

Mean Squared Error (TEST DATA): 358211783712.8122
Mean Squared Error (TRAINING DATA): 396407243384.6567

Root Mean Squared Error (TEST DATA): 598507.9646193626
Root Mean Squared Error (TRAINING DATA): 629608.8018640279

R-squared (TEST DATA): 0.8203245625102882
R-squared (TRAINING DATA): 0.817816964579743


The RMSE and MAE of the training and testing data seem closer together, but they are still high compared to the default model. I will try log transforming it first, and training the model again on the log transformed data to see if the error can get reduced, since it seems like the outliers are affecting the RMSE and log transforming can help reduce the skewness.

In [36]:
# Log transform target
Y_train_log = np.log1p(Y_train)
Y_test_log = np.log1p(Y_test)

In [37]:
# Training model on log target
rf_model1.fit(X_train, Y_train_log)

In [38]:
# Predict, then inverse transform
Y_pred_test1_log_ = rf_model1.predict(X_test)
Y_pred_train1_log_ = rf_model1.predict(X_train)

# Reverse transformation
Y_pred_test1_log = np.expm1(Y_pred_test1_log_)
Y_pred_train1_log = np.expm1(Y_pred_train1_log_)

In [39]:
evaluate_model(Y_pred_test1_log, Y_pred_train1_log, 1.5)

Training Model 1.5 Tests

The goal MAE and RMSE for TRAINING DATA is between 101332.894154535 and 202665.78830907
The goal MAE and RMSE for TESTING DATA is between 101408.74000750855 and 202817.4800150171

Mean Absolute Error (TEST DATA): 76065.61557144616
Mean Absolute Error (TRAINING DATA): 69868.92399453827

Mean Squared Error (TEST DATA): 520411456598.93225
Mean Squared Error (TRAINING DATA): 672987568276.4142

Root Mean Squared Error (TEST DATA): 721395.492499733
Root Mean Squared Error (TRAINING DATA): 820358.1951052932

R-squared (TEST DATA): 0.7389668336147286
R-squared (TRAINING DATA): 0.690704647720773


The log transformation seemed to make the model perform worse. I think I will start from the default model and adjust the hyperparameters from there instead.

#### Training Model 2 (Archived)

In [40]:
rf_model2 = RandomForestRegressor(
    n_estimators = 100, # increase from 50
    max_depth = None, # to capture more complex patterns
    min_samples_split = 2, # decrease from 5
    min_samples_leaf = 1, # decrease from 10
    max_features = 'sqrt', # introduces randomness and can have better generalization
    random_state = 42, # for reproducibility
    n_jobs = -1
)

In [41]:
# Fit model using training data
rf_model2.fit(X_train, Y_train)

In [42]:
# Making prediction based on model
Y_pred_test2 = rf_model2.predict(X_test)
Y_pred_train2 = rf_model2.predict(X_train)

In [43]:
# Calculating different metrics
evaluate_model(Y_pred_test2, Y_pred_train2, 2)

Training Model 2 Tests

The goal MAE and RMSE for TRAINING DATA is between 101332.894154535 and 202665.78830907
The goal MAE and RMSE for TESTING DATA is between 101408.74000750855 and 202817.4800150171

Mean Absolute Error (TEST DATA): 55383.420098938186
Mean Absolute Error (TRAINING DATA): 21004.998886732756

Mean Squared Error (TEST DATA): 360789623664.60803
Mean Squared Error (TRAINING DATA): 64476573057.16535

Root Mean Squared Error (TEST DATA): 600657.6592907212
Root Mean Squared Error (TRAINING DATA): 253922.37604662837

R-squared (TEST DATA): 0.8190315438487672
R-squared (TRAINING DATA): 0.9703674996128867


The MAE and RMSE is still larger on the training data, so this implies overfitting and doesn't generalize well. I will fix the hyperparameters to reduce the overfitting.

#### Training Model 3 (Archived)

In [45]:
# Initializing Model
rf_model3 = RandomForestRegressor(
    n_estimators = 100, # increase from 50
    max_depth = 20, # Decrease to reduce overfitting
    min_samples_split = 3, # increase from 2
    min_samples_leaf = 3, # increase from 1
    max_features = 'sqrt', # introduces randomness and can have better generalization
    random_state = 42, # for reproducibility
    n_jobs = -1
)

In [46]:
# Fit model using training data
rf_model3.fit(X_train, Y_train)

In [47]:
# Making prediction based on model
Y_pred_test3 = rf_model3.predict(X_test)
Y_pred_train3 = rf_model3.predict(X_train)

In [48]:
# Calculating different metrics
evaluate_model(Y_pred_test3, Y_pred_train3, 3)

Training Model 3 Tests

The goal MAE and RMSE for TRAINING DATA is between 101332.894154535 and 202665.78830907
The goal MAE and RMSE for TESTING DATA is between 101408.74000750855 and 202817.4800150171

Mean Absolute Error (TEST DATA): 56209.74696120852
Mean Absolute Error (TRAINING DATA): 43842.52222513759

Mean Squared Error (TEST DATA): 349572723387.88043
Mean Squared Error (TRAINING DATA): 313871446230.4825

Root Mean Squared Error (TEST DATA): 591246.7533846427
Root Mean Squared Error (TRAINING DATA): 560242.3102823299

R-squared (TEST DATA): 0.8246578285109025
R-squared (TRAINING DATA): 0.8557492231529981


#### Training Model 4 (Archived)

In [81]:
# Initializing model
rf_model4 = RandomForestRegressor(
    n_estimators=200,  # Increase number of trees from 100 to 200
    max_depth=15,      # Limit tree depth to 15
    max_features=None,   # Changing from 'sqrt' to 'None' to see if more powerful splits reduces error
    random_state=42,       # Ensure reproducibility
    n_jobs=-1              # Use all available cores
)

In [82]:
# Fit model using training data
rf_model4.fit(X_new_train, Y_new_train)

In [83]:
# Making prediction based on model, using both trained data and test data
Y_pred_val_4 = rf_model4.predict(X_val)
Y_pred_train_4 = rf_model4.predict(X_new_train)

In [84]:
# Calculating different metrics
evaluate_model(Y_pred_val_4, Y_pred_train_4, 4)

Training Model 4 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 67993.49282547204
Mean Absolute Error (TRAINING DATA): 49604.66512397031

Mean Squared Error (VALIDATION DATA): 327212624143.7148
Mean Squared Error (TRAINING DATA): 107575685792.61897

Root Mean Squared Error (VALIDATION DATA): 572025.0205574182
Root Mean Squared Error (TRAINING DATA): 327987.32565850613

R-squared (VAL DATA): 0.7945468139768982
R-squared (TRAINING DATA): 0.9536647279138283


The model performed worse than the default model since it was overfitting. I will experiment with max_depth, min_samples_split, and min_samples_leaf to avoid overfitting using GridSearchCV.

### Training Model 5 (GridSearchCV)

Since GridSearchCV takes a while, I will take a subset of the training and validation data to run this smoothly. I will take a subset of approximately 20% of the original training data, which is around 80k rows.

In [11]:
# Take 20% subset of X_train and Y_train

X_train_subset, _, Y_train_subset, _ = train_test_split(
    X_train, Y_train, 
    test_size=0.8,   # keep 20%, drop 80%
    random_state=42, # reproducible
)

# 80% new_train, 20% val (64% new_train, 16% val, 20% testing)
X_new_train_subset, X_val_subset, Y_new_train_subset, Y_val_subset = train_test_split(
    X_train_subset, Y_train_subset, test_size=0.2, random_state=42)

# Checking number of rows
print(f"New Training set size: {X_new_train_subset.shape[0]} rows")
print(f"Validation set size: {X_val_subset.shape[0]} rows")

New Training set size: 64608 rows
Validation set size: 16153 rows


In [22]:
# Initialize list of hyperparameters to try
param_list = {
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'n_estimators': [100],
    'max_features': ['sqrt', 'log2']
}

In [26]:
# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator = rf_model_default,
    param_grid = param_list,
    cv = 3,
    n_jobs = -1, # use all cores
    scoring = 'neg_mean_absolute_error',
    verbose = 2,
)

In [27]:
# Fit model using training data
grid_search.fit(X_new_train_subset, Y_new_train_subset)

Fitting 3 folds for each of 54 candidates, totalling 162 fits
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   4.0s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   4.1s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   4.1s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   4.1s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   4.2s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   4.2s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   4.2s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=

In [28]:
# Best parameters found
print("Best parameters: ", grid_search.best_params_)

Best parameters:  {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}


In [29]:
# Retrain model on best parametesr with the best parameters
rf_model5 = RandomForestRegressor(
    max_depth= None, 
    max_features= 'sqrt', 
    min_samples_leaf= 1, 
    min_samples_split= 2, 
    n_estimators= 100,
    random_state= 42
)

# Training model on original training dataset
rf_model5.fit(X_new_train, Y_new_train)

In [30]:
# Making prediction based on model, using both trained data and test data
Y_pred_val_cv1 = rf_model5.predict(X_val)
Y_pred_train_cv1 = rf_model5.predict(X_new_train)

In [31]:
# Calculating different metrics
evaluate_model(Y_pred_val_cv1, Y_pred_train_cv1, 5)

Training Model 5 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 196681.82540658096
Mean Absolute Error (TRAINING DATA): 75096.74765458908

Mean Squared Error (VALIDATION DATA): 412648770428.05804
Mean Squared Error (TRAINING DATA): 125706027121.6214

Root Mean Squared Error (VALIDATION DATA): 642377.4361137368
Root Mean Squared Error (TRAINING DATA): 354550.4577935578

R-squared (VAL DATA): 0.7409024030939471
R-squared (TRAINING DATA): 0.945855581336655


It appears to be possibly overfitting since the RMSE and R^2 of both the training and validation data are still high, but lower than the default model. I will try modifying some of the parameters to see if it could be fixed.

### Training Model 6

In [32]:
# Initializing model
rf_model6 = RandomForestRegressor(
    n_estimators=200,  # Constant
    max_depth=30,      # Changing to a constant value
    min_samples_leaf=2, # Increasing from 1 to 2
    min_samples_split=5, # Increase number from 2 to 5
    max_features=0.5,   # Changing from 'sqrt' to 0.5
    random_state=42,       # Ensure reproducibility
    n_jobs=-1              # Use all available cores
)

In [33]:
# Fit model using training data
rf_model6.fit(X_new_train, Y_new_train)

In [34]:
# Making prediction based on model, using both trained data and test data
Y_pred_val_6 = rf_model6.predict(X_val)
Y_pred_train_6 = rf_model6.predict(X_new_train)

In [35]:
# Calculating different metrics
evaluate_model(Y_pred_val_6, Y_pred_train_6, 6)

Training Model 6 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 170634.75640035016
Mean Absolute Error (TRAINING DATA): 96946.62311321456

Mean Squared Error (VALIDATION DATA): 358542041783.15106
Mean Squared Error (TRAINING DATA): 427818173992.2377

Root Mean Squared Error (VALIDATION DATA): 598783.8022050622
Root Mean Squared Error (TRAINING DATA): 654078.110008459

R-squared (VAL DATA): 0.7748754193077136
R-squared (TRAINING DATA): 0.8157290715900822


The overall metrics are worse than Model 5's. I think I will go back to Model 5 and attempt to perform log transformation on the dependent variable to reduce the effects of the extreme outliers.

### Training Model 7

In [36]:
# Initializing a copy of Model 6
rf_model7 = RandomForestRegressor(
    max_depth= None, 
    max_features= 'sqrt', 
    min_samples_leaf= 1, 
    min_samples_split= 2, 
    n_estimators= 100,
    random_state= 42
)

In [37]:
# Log transform price variable
Y_new_train_log = np.log1p(Y_new_train)

# Train the model
rf_model7.fit(X_new_train, Y_new_train_log)

In [38]:
# Making prediction based on model, using both trained data and test data
Y_pred_val_7_log = rf_model7.predict(X_val)
Y_pred_train_7_log = rf_model7.predict(X_new_train)

# Revert the log transformation
Y_pred_val_7 = np.expm1(Y_pred_val_7_log)
Y_pred_train_7 = np.expm1(Y_pred_train_7_log)

In [43]:
# Calculating different metrics
evaluate_model(Y_pred_val_7, Y_pred_train_7, 7)

Training Model 7 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 190275.3289796595
Mean Absolute Error (TRAINING DATA): 81933.15539268144

Mean Squared Error (VALIDATION DATA): 503640999579.1363
Mean Squared Error (TRAINING DATA): 575099838289.2334

Root Mean Squared Error (VALIDATION DATA): 709676.6866532508
Root Mean Squared Error (TRAINING DATA): 758353.3729662139

R-squared (VAL DATA): 0.6837693892582024
R-squared (TRAINING DATA): 0.752291539788879


The model did worse than model 5 overall. I will try to use RandomizedSearchCV on it to get a better idea of which parameters to use.

### Training Model 8 (RandomizedSearchCV)

In [44]:
# Initialize list of hyperparameters to try
param_list2 = {
    'max_depth': [10, 20, 30, 40],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4, 10],
    'n_estimators': [100, 200],
    'max_features': ['sqrt', 'log2', 0.5]  
}

In [45]:
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_model_default,
    param_distributions=param_list2,
    n_iter=10,            # Number of random combos to try
    cv=3,                 
    scoring='neg_mean_absolute_error',
    n_jobs=-1,            # Use all CPU cores
    verbose=2,
    random_state=42       # For reproducibility
)

In [46]:
# Fit to subset of training data
random_search.fit(X_new_train_subset, Y_new_train_subset)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END max_depth=30, max_features=log2, min_samples_leaf=10, min_samples_split=5, n_estimators=100; total time=   5.3s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=10, min_samples_split=5, n_estimators=100; total time=   5.3s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=10, min_samples_split=5, n_estimators=100; total time=   5.4s
[CV] END max_depth=40, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   6.3s
[CV] END max_depth=40, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   6.5s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=10, min_samples_split=2, n_estimators=100; total time=   3.7s
[CV] END max_depth=30, max_features=0.5, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=  17.0s
[CV] END max_depth=30, max_features=0.5, min_samples_leaf=2, min_samples_split=10, n_estimato

In [47]:
# Best params found
print("Best parameters:", random_search.best_params_)

Best parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 0.5, 'max_depth': 40}


In [50]:
# Best model
rf_model8 = RandomForestRegressor(
    n_estimators=200,
    min_samples_leaf=1,
    min_samples_split=5,
    max_features=0.5,
    max_depth=40
)

# Train model on full training dataset
rf_model8.fit(X_new_train, Y_new_train)

In [52]:
# Making prediction based on model, using both trained data and test data
Y_pred_val8 = rf_model8.predict(X_val)
Y_pred_train8 = rf_model8.predict(X_new_train)


In [53]:
# Calculating different metrics
evaluate_model(Y_pred_val8, Y_pred_train8, 8)

Training Model 8 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 170998.15581891214
Mean Absolute Error (TRAINING DATA): 83396.21140844488

Mean Squared Error (VALIDATION DATA): 366129324242.4206
Mean Squared Error (TRAINING DATA): 234260796885.90878

Root Mean Squared Error (VALIDATION DATA): 605086.2122395623
Root Mean Squared Error (TRAINING DATA): 484004.95543528144

R-squared (VAL DATA): 0.7701114486064199
R-squared (TRAINING DATA): 0.8990985957202537


The model did slighty better than model 5, however the errors are still high. I will try RandomizedSearchCV one more time using another set of parameters.

### Training Model 9 (RandomizedSearchCV)

In [19]:
# Initialize list of hyperparameters to try
param_list3 = {
    'n_estimators': [200, 300],
    'max_depth': [20, 30, 40],
    'min_samples_split': [5, 10, 20],
    'min_samples_leaf': [2, 5, 10],
    'max_features': [0.5, 'sqrt', 'log2']
}

In [14]:
# Initialize RandomizedSearchCV
random_search2 = RandomizedSearchCV(
    estimator=rf_model_default,
    param_distributions=param_list3,
    n_iter=10,            # Number of random combos to try
    cv=3,                 
    scoring='neg_mean_absolute_error',
    n_jobs=-1,            # Use all CPU cores
    verbose=2,
    random_state=42       # For reproducibility
)

In [15]:
# Fit to subset of training data
random_search2.fit(X_new_train_subset, Y_new_train_subset)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END max_depth=40, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=200; total time=  10.1s
[CV] END max_depth=40, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=200; total time=  10.2s
[CV] END max_depth=40, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=200; total time=  10.5s
[CV] END max_depth=40, max_features=sqrt, min_samples_leaf=2, min_samples_split=20, n_estimators=300; total time=  21.5s
[CV] END max_depth=40, max_features=sqrt, min_samples_leaf=2, min_samples_split=20, n_estimators=300; total time=  21.6s
[CV] END max_depth=40, max_features=sqrt, min_samples_leaf=2, min_samples_split=20, n_estimators=300; total time=  22.2s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200; total time=  12.1s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_est



[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  21.1s
[CV] END max_depth=30, max_features=0.5, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  57.6s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  19.2s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  19.3s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  22.0s
[CV] END max_depth=30, max_features=0.5, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  59.5s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  19.6s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=300; total time=  17.6s
[CV] END max_depth=20, max_features

In [16]:
# Best params found
print("Best parameters:", random_search2.best_params_)

Best parameters: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 0.5, 'max_depth': 30}


In [49]:
# Best model
rf_model9 = RandomForestRegressor(
    n_estimators=250,
    min_samples_leaf=2,
    min_samples_split=5,
    max_features=0.5,
    max_depth=30,
    random_state = 42
)

# Train model on full training dataset
rf_model9.fit(X_new_train, Y_new_train)

In [50]:
# Making prediction based on model, using both trained data and test data
Y_pred_val9 = rf_model9.predict(X_val)
Y_pred_train9 = rf_model9.predict(X_new_train)

In [57]:
# Calculating different metrics
evaluate_model(Y_pred_val9, Y_pred_train9, 9)

Training Model 9 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 170319.37498336943
Mean Absolute Error (TRAINING DATA): 96815.72348568776

Mean Squared Error (VALIDATION DATA): 352777550606.6239
Mean Squared Error (TRAINING DATA): 432764494647.6736

Root Mean Squared Error (VALIDATION DATA): 593950.7981361957
Root Mean Squared Error (TRAINING DATA): 657848.382720269

R-squared (VAL DATA): 0.7784948795321438
R-squared (TRAINING DATA): 0.8135985798185782


Based on the results, the model generalizes a little better with the validation data. The training data's metrics are worse, since it was due to reducing overfitting. I think I will move on for now, and will possibly return later to adjust it, since the RMSE is the only metric way outside the goal range.

### Handling Outliers

In [12]:
# Choosing cutoffs 
lower_percentile = 1
upper_percentile = 99

# Calculating cutoff values
lower_cutoff = np.percentile(Y, lower_percentile)
upper_cutoff = np.percentile(Y, upper_percentile)

# Keeping only rows of Y within these cutoffs
filtered_dataset = (Y >= lower_cutoff) & (Y <= upper_cutoff)
X_filtered = X[filtered_dataset]
Y_filtered = Y[filtered_dataset]

print(f"Original dataset size: {X.shape[0]} rows")
print(f"Cleaned dataset size: {X_filtered.shape[0]} rows")

Original dataset size: 504757 rows
Cleaned dataset size: 494816 rows


In [13]:
# Splitting datasets again

# 80% Training, 20% Testing
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_filtered, Y_filtered, test_size = 0.2, random_state = 42)

# 80% new_train, 20% val (64% new_train, 16% val, 20% testing)
X_new_train2, X_val2, Y_new_train2, Y_val2 = train_test_split(X_train2, Y_train2, test_size=0.2, random_state=42)


# Checking number of rows
print(f"Training set size: {X_train2.shape[0]} rows")
print(f"Testing set size: {X_test2.shape[0]} rows")

Training set size: 395852 rows
Testing set size: 98964 rows


In [14]:
# Recalculating for new dataset
# The MAE is aiming for 10-20% of the mean
# Using RMSE, the root squared of mse, to easily interpret, it is also aiming for 10-20% of the mean

mean_y_new_train2 = Y_new_train2.mean()
mean_y_val2 = Y_val2.mean()
mean_y_test2 = Y_test2.mean()

print(f"Y_new_train2's mean is: {mean_y_new_train2}\n")
print(f"Y_val2's mean is: {mean_y_val2}\n")
print(f"Y_test2's mean is: {mean_y_test2}\n")

goal_new_train2 = [mean_y_new_train2 * 0.1, mean_y_new_train2 * 0.2]
goal_val2 = [mean_y_val2 * 0.1, mean_y_val2 * 0.2]
goal_test2 = [mean_y_test2*0.1, mean_y_test2*0.2]

print(f"The goal MAE/RMSE for TRAINED DATA is between {goal_new_train2[0]} and {goal_new_train2[1]}\n")
print(f"The goal MAE/RMSE for VALIDATION DATA is between {goal_val2[0]} and {goal_val2[1]}\n")
print(f"The goal MAE/RMSE for TEST DATA is between {goal_test2[0]} and {goal_test2[1]}\n")

Y_new_train2's mean is: 939818.5591323128

Y_val2's mean is: 938325.6334950927

Y_test2's mean is: 937454.3253655875

The goal MAE/RMSE for TRAINED DATA is between 93981.85591323128 and 187963.71182646256

The goal MAE/RMSE for VALIDATION DATA is between 93832.56334950928 and 187665.12669901855

The goal MAE/RMSE for TEST DATA is between 93745.43253655876 and 187490.86507311751



In [15]:
# creating a function to easily evaluate the model for future uses

def evaluate_model2(Y_pred_val0, Y_pred_train0, num):
    # Calculating different metrics

    print(f"Training Model {num} Tests\n")
    print(f"The goal MAE and RMSE for TRAINING DATA is between {goal_new_train2[0]} and {goal_new_train2[1]}")
    print(f"The goal MAE and RMSE for VALIDATION DATA is between {goal_val2[0]} and {goal_val2[1]}\n")

    # Calculate Mean Absolute Error (MAE)
    mae_val = mean_absolute_error(Y_val2, Y_pred_val0)
    mae_train = mean_absolute_error(Y_new_train2, Y_pred_train0)
    print(f"Mean Absolute Error (VALIDATION DATA): {mae_val}")
    print(f"Mean Absolute Error (TRAINING DATA): {mae_train}\n")


    # Calculate Root Mean Squared Error (RMSE)
    mse_val = mean_squared_error(Y_val2, Y_pred_val0)
    mse_train = mean_squared_error(Y_new_train2, Y_pred_train0)
    print(f"Mean Squared Error (VALIDATION DATA): {mse_val}")
    print(f"Mean Squared Error (TRAINING DATA): {mse_train}\n")

    rmse_val = math.sqrt(mse_val)
    rmse_train = math.sqrt(mse_train)
    print(f"Root Mean Squared Error (VALIDATION DATA): {rmse_val}")
    print(f"Root Mean Squared Error (TRAINING DATA): {rmse_train}\n")

    # Calculate R-squared (goodness of fit)
    r2_val = r2_score(Y_val2, Y_pred_val0)
    r2_train = r2_score(Y_new_train2, Y_pred_train0)
    print(f"R-squared (VAL DATA): {r2_val}")
    print(f"R-squared (TRAINING DATA): {r2_train}")

Next, I will try running the default model on the new dataset.

### Training Model 10 (Default Model)

In [17]:
# Initializing a model based on default parameters
rf_model_default_2 = RandomForestRegressor(random_state = 42)

In [25]:
# Fit model using training data
rf_model_default_2.fit(X_new_train2, Y_new_train2)

In [26]:
# Making prediction based on model, using both trained data and test data

# Unfiltered
Y_pred_val_default_2_unfiltered = rf_model_default_2.predict(X_val)
Y_pred_train_default_2_unfiltered = rf_model_default_2.predict(X_new_train)

# Filtered
Y_pred_val_default_2_filtered = rf_model_default_2.predict(X_val2)
Y_pred_train_default_2_filtered = rf_model_default_2.predict(X_new_train2)

In [28]:
# Calculating different metrics

# Unfiltered
print("Unfiltered: \n")
evaluate_model(Y_pred_val_default_2_unfiltered, Y_pred_train_default_2_unfiltered, 10)

# Filtered
print("Filtered: \n")
evaluate_model2(Y_pred_val_default_2_filtered, Y_pred_train_default_2_filtered, 10)

Unfiltered: 

Training Model 10 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 128230.92848038416
Mean Absolute Error (TRAINING DATA): 131806.57369557815

Mean Squared Error (VALIDATION DATA): 557294853650.2103
Mean Squared Error (TRAINING DATA): 1281310673376.88

Root Mean Squared Error (VALIDATION DATA): 746521.8373565574
Root Mean Squared Error (TRAINING DATA): 1131949.942964299

R-squared (VAL DATA): 0.6500807279781926
R-squared (TRAINING DATA): 0.44811061867376734
Filtered: 

Training Model 10 Tests

The goal MAE and RMSE for TRAINING DATA is between 93981.85591323128 and 187963.71182646256
The goal MAE and RMSE for VALIDATION DATA is between 93832.56334950928 and 187665.12669901855

Mean Absolute Error (VALIDATION DATA): 132875.12942124144
Mean Absolute Error (TRAINING DATA): 49444.48146814442

Me

Based on the results, for the filtered dataset, the training data's metrics are stronger compared to the validation data's, which is already close. This suggests overfitting. The unfiltered dataset has poorer performance, like high RMSE and low R^2, due to the model being trained without seeing any extreme outliers. I will try to perform RandomizedSearchCV on the model to see if I could optimize it more, since it may not have worked well earlier due to the heavy outliers.

### Training Model 11 (RandomizedSearchCV)

In [20]:
# Initialize RandomizedSearchCV
random_search3 = RandomizedSearchCV(
    estimator=rf_model_default_2,
    param_distributions=param_list3, # initialized earlier in Model 9 
    n_iter=10,            # Number of random combos to try
    cv=3,                 
    scoring='neg_mean_absolute_error',
    n_jobs=-1,            # Use all CPU cores
    verbose=2,
    random_state=42       # For reproducibility
)

Similar to earlier models, I will take a subset of the training data for resource efficiency.

In [21]:
# Take 20% subset of X_train and Y_train

X_train_subset2, _, Y_train_subset2, _ = train_test_split(
    X_train2, Y_train2, 
    test_size=0.8,   # keep 20%, drop 80%
    random_state=42, # reproducible
)

# 80% new_train, 20% val (64% new_train, 16% val, 20% testing)
X_new_train_subset2, X_val_subset2, Y_new_train_subset2, Y_val_subset2 = train_test_split(
    X_train_subset2, Y_train_subset2, test_size=0.2, random_state=42)

# Checking number of rows
print(f"New Training set size: {X_new_train_subset2.shape[0]} rows")
print(f"Validation set size: {X_val_subset2.shape[0]} rows")

New Training set size: 63336 rows
Validation set size: 15834 rows


In [22]:
# Fit to training data
random_search3.fit(X_new_train_subset2, Y_new_train_subset2)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END max_depth=40, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=200; total time=  10.9s
[CV] END max_depth=40, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=200; total time=  10.9s
[CV] END max_depth=40, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=200; total time=  11.2s
[CV] END max_depth=40, max_features=sqrt, min_samples_leaf=2, min_samples_split=20, n_estimators=300; total time=  20.8s
[CV] END max_depth=40, max_features=sqrt, min_samples_leaf=2, min_samples_split=20, n_estimators=300; total time=  21.0s
[CV] END max_depth=40, max_features=sqrt, min_samples_leaf=2, min_samples_split=20, n_estimators=300; total time=  19.8s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200; total time=  12.2s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_est



[CV] END max_depth=30, max_features=0.5, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  54.5s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  20.0s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  17.5s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  17.7s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  20.7s
[CV] END max_depth=30, max_features=0.5, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  56.8s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=5, min_samples_split=20, n_estimators=300; total time=  18.0s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=10, min_samples_split=10, n_estimators=300; total time=  16.5s
[CV] END max_depth=20, max_features

In [23]:
# Best params found
print("Best parameters:", random_search3.best_params_)

Best parameters: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 0.5, 'max_depth': 30}


In [24]:
# Best model
rf_model11 = RandomForestRegressor(
    n_estimators=250,
    min_samples_leaf=2,
    min_samples_split=5,
    max_features=0.5,
    max_depth=30,
    random_state=42
)

# Train model on full training dataset
rf_model11.fit(X_new_train2, Y_new_train2)

In [25]:
# Making prediction based on model, using both trained data and test data

# Filtered
Y_pred_val11 = rf_model11.predict(X_val2)
Y_pred_train11 = rf_model11.predict(X_new_train2)

# Unfiltered
Y_pred_val11_unfiltered = rf_model11.predict(X_val)
Y_pred_train11_unfiltered = rf_model11.predict(X_new_train)

In [26]:
# Calculating different metrics

# Unfiltered
print("Unfiltered: \n")
evaluate_model(Y_pred_val11_unfiltered, Y_pred_train11_unfiltered, 11)

# Filtered
print("Filtered: \n")
evaluate_model2(Y_pred_val11, Y_pred_train11, 11)

Unfiltered: 

Training Model 11 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 144314.38845966192
Mean Absolute Error (TRAINING DATA): 147618.77236896186

Mean Squared Error (VALIDATION DATA): 580264574677.3638
Mean Squared Error (TRAINING DATA): 1305128353899.981

Root Mean Squared Error (VALIDATION DATA): 761750.9925673637
Root Mean Squared Error (TRAINING DATA): 1142422.1434741104

R-squared (VAL DATA): 0.6356582943100534
R-squared (TRAINING DATA): 0.4378518069416545
Filtered: 

Training Model 11 Tests

The goal MAE and RMSE for TRAINING DATA is between 93981.85591323128 and 187963.71182646256
The goal MAE and RMSE for VALIDATION DATA is between 93832.56334950928 and 187665.12669901855

Mean Absolute Error (VALIDATION DATA): 132093.05023195042
Mean Absolute Error (TRAINING DATA): 72901.45628058075

M

The model performed better than the default model, but it still has trouble generalizing to new data and is overfitting. I will adjust the parameter list to attempt to make it underfit.

### Training Model 11 (Archived)

In [42]:
# Initializing a model based on default parameters
rf_model11 = RandomForestRegressor(
    n_estimators = 200, # increased from 100 to 200
    random_state = 42
)

In [43]:
# Fit model using training data
rf_model11.fit(X_new_train2, Y_new_train2)

In [51]:
# Making prediction based on model, using both trained data and test data

# Filtered
Y_pred_val_11 = rf_model11.predict(X_val2)
Y_pred_train_11 = rf_model11.predict(X_new_train2)

# Unfiltered
Y_pred_val11_unfiltered = rf_model11.predict(X_val)
Y_pred_train11_unfiltered = rf_model11.predict(X_new_train)

In [52]:
# Calculating different metrics

# Filtered
print("Filtered: \n")
evaluate_model2(Y_pred_val_11, Y_pred_train_11, 11)

# Unfiltered
print("Unfiltered: \n")
evaluate_model(Y_pred_val11_unfiltered, Y_pred_train11_unfiltered, 11)

Filtered: 

Training Model 11 Tests

The goal MAE and RMSE for TRAINING DATA is between 93981.85591323128 and 187963.71182646256
The goal MAE and RMSE for VALIDATION DATA is between 93832.56334950928 and 187665.12669901855

Mean Absolute Error (VALIDATION DATA): 42848.66603414911
Mean Absolute Error (TRAINING DATA): 15850.164963119276

Mean Squared Error (VALIDATION DATA): 8651130967.81139
Mean Squared Error (TRAINING DATA): 1140460563.3525832

Root Mean Squared Error (VALIDATION DATA): 93011.4561105856
Root Mean Squared Error (TRAINING DATA): 33770.70569817254

R-squared (VAL DATA): 0.9871598042551616
R-squared (TRAINING DATA): 0.9983285561540423
Unfiltered: 

Training Model 11 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 67696.52228936285
Mean Absolute Error (TRAINING DATA): 71313.31590677229

Mean 

The model only performed slightly better, lowering the RMSE by a few hundreds, while R^2 did not change significantly. I think that it is important to try to generalize the model at this point, especially since outliers were filtered out. I will try to force the model to generalize more and use RandomizedSearchCV on max_depth, min_samples_leaf, and max_features.

### Training Model 12 (RandomizedSearchCV)

In [27]:
# Initialize list of hyperparameters to try
param_list4 = {
    'n_estimators': [200, 250],
    'max_depth': [15, 20, 30],
    'min_samples_split': [5, 10, 20],
    'min_samples_leaf': [2, 5, 10, 20],
    'max_features': [0.3, 0.5, 'sqrt']
}

In [28]:
# Initialize RandomizedSearchCV
random_search4 = RandomizedSearchCV(
    estimator=rf_model11,
    param_distributions=param_list4, # initialized earlier in Model 10 
    n_iter=10,            # Number of random combos to try
    cv=3,                 
    scoring='neg_mean_absolute_error',
    n_jobs=-1,            # Use all CPU cores
    verbose=2,
    random_state=42       # For reproducibility
)

In [29]:
# Fit to training data
random_search4.fit(X_new_train_subset2, Y_new_train_subset2)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=20, min_samples_split=5, n_estimators=200; total time=  10.4s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=20, min_samples_split=5, n_estimators=200; total time=  10.5s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=20, min_samples_split=10, n_estimators=250; total time=  12.9s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=20, min_samples_split=10, n_estimators=250; total time=  13.1s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=20, min_samples_split=10, n_estimators=250; total time=  13.3s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=250; total time=  16.3s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=250; total time=  16.3s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_est

In [30]:
# Best params found
print("Best parameters:", random_search4.best_params_)

Best parameters: {'n_estimators': 250, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_features': 0.5, 'max_depth': 30}


In [31]:
# Best Model
rf_model12 = RandomForestRegressor(
    n_estimators=250,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features=0.5,
    max_depth=30, 
    random_state = 42
)

# Train model on full training dataset
rf_model12.fit(X_new_train2, Y_new_train2)

In [32]:
# Making prediction based on model, using both trained data and test data

# Filtered
Y_pred_val12 = rf_model12.predict(X_val2)
Y_pred_train12 = rf_model12.predict(X_new_train2)

# Unfiltered
Y_pred_val12_unfiltered = rf_model12.predict(X_val)
Y_pred_train12_unfiltered = rf_model12.predict(X_new_train)

In [33]:
# Calculating different metrics

# Unfiltered
print("Unfiltered: \n")
evaluate_model(Y_pred_val12_unfiltered, Y_pred_train12_unfiltered, 12)

# Filtered
print("Filtered: \n")
evaluate_model2(Y_pred_val12, Y_pred_train12, 12)

Unfiltered: 

Training Model 12 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 163895.69091801514
Mean Absolute Error (TRAINING DATA): 167401.32519389808

Mean Squared Error (VALIDATION DATA): 603044997197.3063
Mean Squared Error (TRAINING DATA): 1327956316313.4

Root Mean Squared Error (VALIDATION DATA): 776559.7190154189
Root Mean Squared Error (TRAINING DATA): 1152369.8695789473

R-squared (VAL DATA): 0.621354719079964
R-squared (TRAINING DATA): 0.42801928910265363
Filtered: 

Training Model 12 Tests

The goal MAE and RMSE for TRAINING DATA is between 93981.85591323128 and 187963.71182646256
The goal MAE and RMSE for VALIDATION DATA is between 93832.56334950928 and 187665.12669901855

Mean Absolute Error (VALIDATION DATA): 135683.20343193001
Mean Absolute Error (TRAINING DATA): 100983.16910668983

Me

The model performed worse than Model 11. I will try to underfit it by manually inputting in values instead of using RandomizedSearchCV.

### Training Model 13

In [35]:
# Initializing model
rf_model13 = RandomForestRegressor(
    n_estimators=250,  # Constant
    max_depth=20,      # Decreasing from 30 to 20
    min_samples_leaf=8, # Increasing from 5 to 8
    min_samples_split=10, # Constant
    max_features=0.4,   # Decreasing from 0.5 to 0.4
    random_state=42,       # Ensure reproducibility
    n_jobs=-1              # Use all available cores
)

In [36]:
# Fit model using training data
rf_model13.fit(X_new_train2, Y_new_train2)

In [38]:
# Making prediction based on model, using both trained data and test data

# Filtered
Y_pred_val13_filtered = rf_model13.predict(X_val2)
Y_pred_train13_filtered = rf_model13.predict(X_new_train2)

# Unfiltered
Y_pred_val13_unfiltered = rf_model13.predict(X_val)
Y_pred_train13_unfiltered = rf_model13.predict(X_new_train)

In [39]:
# Calculating different metrics

# Unfiltered
print("Unfiltered: \n")
evaluate_model(Y_pred_val13_unfiltered, Y_pred_train13_unfiltered, 13)

# Filtered
print("Filtered: \n")
evaluate_model2(Y_pred_val13_filtered, Y_pred_train13_filtered, 13)

Unfiltered: 

Training Model 13 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 187139.5550425616
Mean Absolute Error (TRAINING DATA): 190678.05188213818

Mean Squared Error (VALIDATION DATA): 641371156025.1105
Mean Squared Error (TRAINING DATA): 1366510922364.9001

Root Mean Squared Error (VALIDATION DATA): 800856.514005543
Root Mean Squared Error (TRAINING DATA): 1168978.5807981684

R-squared (VAL DATA): 0.5972901480390209
R-squared (TRAINING DATA): 0.41141295144922463
Filtered: 

Training Model 13 Tests

The goal MAE and RMSE for TRAINING DATA is between 93981.85591323128 and 187963.71182646256
The goal MAE and RMSE for VALIDATION DATA is between 93832.56334950928 and 187665.12669901855

Mean Absolute Error (VALIDATION DATA): 147398.80890600727
Mean Absolute Error (TRAINING DATA): 128051.71096641905



The model underfitted, but it still performs worse than Model 12. I will adjust some more parameters.

### Training Model 14

In [40]:
# Initializing model
rf_model14 = RandomForestRegressor(
    n_estimators=250,  # Increasing from 250 to 300
    max_depth=25,      # Increasing from 20 to 25
    min_samples_leaf=6, # Decreasing from 8 to 6
    min_samples_split=10, # Constant
    max_features=0.5,   # Increasing from 0.4 to 0.5
    random_state=42,       # Ensure reproducibility
    n_jobs=-1              # Use all available cores
)

In [41]:
# Fit model using training data
rf_model14.fit(X_new_train2, Y_new_train2)

In [42]:
# Making prediction based on model, using both trained data and test data

# Filtered
Y_pred_val14_filtered = rf_model14.predict(X_val2)
Y_pred_train14_filtered = rf_model14.predict(X_new_train2)

# Unfiltered
Y_pred_val14_unfiltered = rf_model14.predict(X_val)
Y_pred_train14_unfiltered = rf_model14.predict(X_new_train)

In [43]:
# Calculating different metrics

# Unfiltered
print("Unfiltered: \n")
evaluate_model(Y_pred_val14_unfiltered, Y_pred_train14_unfiltered, 14)

# Filtered
print("Filtered: \n")
evaluate_model2(Y_pred_val14_filtered, Y_pred_train14_filtered, 14)

Unfiltered: 

Training Model 14 Tests

The goal MAE and RMSE for TRAINING DATA is between 101397.28653883375 and 202794.5730776675
The goal MAE and RMSE for VALIDATION DATA is between 101075.32461734005 and 202150.6492346801

Mean Absolute Error (VALIDATION DATA): 168735.03446863286
Mean Absolute Error (TRAINING DATA): 172307.68506174354

Mean Squared Error (VALIDATION DATA): 608470821699.6736
Mean Squared Error (TRAINING DATA): 1333573068856.8008

Root Mean Squared Error (VALIDATION DATA): 780045.3972043381
Root Mean Squared Error (TRAINING DATA): 1154804.3422401913

R-squared (VAL DATA): 0.6179479039128206
R-squared (TRAINING DATA): 0.42560002720883794
Filtered: 

Training Model 14 Tests

The goal MAE and RMSE for TRAINING DATA is between 93981.85591323128 and 187963.71182646256
The goal MAE and RMSE for VALIDATION DATA is between 93832.56334950928 and 187665.12669901855

Mean Absolute Error (VALIDATION DATA): 137041.5818663775
Mean Absolute Error (TRAINING DATA): 107613.99248599046


The model did better than Model 13, but it still performed worse overall. I think Model 11 is the best mdoel so far.

## Final Model Evaluation

The best models were training model 7 for unfiltered dataset and training model 11 for filtered dataset. I will use both models to predict both the test datasets of the unfiltered and filtered datasets, which will result in 4 sets of metrics total.

In [53]:
# Model 9

# Making prediction 
Y_pred_test_unfiltered_9 = rf_model9.predict(X_test)
Y_pred_test_filtered_9 = rf_model9.predict(X_test2)


# Model 11

# Unfiltered
Y_pred_test_unfiltered_11 = rf_model11.predict(X_test)

# Filtered
Y_pred_test_filtered_11 = rf_model11.predict(X_test2)


In [66]:
# Define a function to calculate metrics

def metrics(y_test, y_pred_test, dataset_name):

    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred_test)
    mse = mean_squared_error(y_test, y_pred_test)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred_test)

    # Filter out zero or negative ClosePrice
    filtered = y_test > 0
    mape = mean_absolute_percentage_error(y_test[filtered], y_pred_test[filtered]) * 100

    

    # Print out metrics

    print(f"{dataset_name}: \n")
    print(f"MAE: {mae:.2f}")
    print(f"MSE: {mse:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"MAPE: {mape:.2f}%")
    print(f"R^2: {r2:.2f}\n")


In [67]:
# Printing out metrics

# Model 9 (Unfiltered)

print("Unfiltered Model")
print("Training Model 9 Metrics:\n")
print("This model was tuned using RandomizedSearchCV. It was trained on the unfiltered dataset.\n")

# Calculating metrics

# Unfiltered
metrics(Y_test, Y_pred_test_unfiltered_9, "Unfiltered")

# Filtered
metrics(Y_test2, Y_pred_test_filtered_9, "Filtered")

print("-------------------------------")

# Model 11 (Filtered)

print("Filtered Model")
print("Training Model 11 Metrics:\n")
print("This model was tuned using RandomizedSearchCV. It was trained on the filtered dataset.\n")

# Calculating metrics

# Unfiltered
metrics(Y_test, Y_pred_test_unfiltered_11, "Unfiltered")

# Filtered
metrics(Y_test2, Y_pred_test_filtered_11, "Filtered")

Unfiltered Model
Training Model 9 Metrics:

This model was tuned using RandomizedSearchCV. It was trained on the unfiltered dataset.

Unfiltered: 

MAE: 174116.97
MSE: 671346340776.57
RMSE: 819357.27
MAPE: 32.38%
R^2: 0.66

Filtered: 

MAE: 101969.91
MSE: 59175628338.86
RMSE: 243260.41
MAPE: 18.21%
R^2: 0.91

-------------------------------
Filtered Model
Training Model 11 Metrics:

This model was tuned using RandomizedSearchCV. It was trained on the filtered dataset.

Unfiltered: 

MAE: 149706.66
MSE: 976268690830.44
RMSE: 988063.10
MAPE: 22.00%
R^2: 0.51

Filtered: 

MAE: 131795.71
MSE: 66015261169.45
RMSE: 256934.35
MAPE: 21.11%
R^2: 0.90



Note: Model 9's RMSE is high and R^2 is lower possibly due to containing extreme outliers by chance.