# Notebook 2: Model Building, Evaluation, and Comparison

## 1. Introduction (Notebook 2)
### Project Overview
This notebook builds upon the insights gained from the exploratory data analysis conducted in the preceding notebook. Here, we focus on developing, evaluating, and comparing the performance of several supervised machine learning regression models for predicting house prices. By experimenting with different algorithms and analyzing their error metrics, we aim to identify the most effective approach for this task and provide a foundation for a practical algorithm selection guide to be published.
### Data Description
The dataset utilized in this analysis, the Melbourne Housing Snapshot, provides detailed information on residential properties sold in Melbourne, Australia. Features include square footage, number of bedrooms and bathrooms, year built, lot size, address, property type, and other pertinent characteristics. The target variable is the sale price of each house. This [dataset](https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot), obtained from [Kaggle](https://www.kaggle.com/), offers a valuable resource for examining the impact of diverse property attributes on market value. It encompasses both numerical and categorical data types, enabling a comprehensive analysis. This is the same dataset that was explored and prepared in the previous Notebook.
### Objective:
The goals of this notebook are:
+ To train and evaluate a variety of supervised machine learning regression models for house price prediction.
+ To utilize appropriate error metrics to assess and compare the predictive performance of each model.
+ To document the performance of each algorithm, providing a clear basis for comparison and analysis in the culminating Medium article.

## Set up and Data Loading

In [2]:
import gradio as gr
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pickle  # To save and load trained models

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
from IPython.display import display, HTML

display(HTML("<script>Jupyter.notebook.config.update({'max_output_size': 50000})</script>"))

In [5]:
%cd ..

/Users/marcelgrossmann/Documents/Project/ml_foundation/ml_foundation


In [6]:
%cd ml_foundation/

[Errno 2] No such file or directory: 'ml_foundation/'
/Users/marcelgrossmann/Documents/Project/ml_foundation/ml_foundation


In [7]:
path = "dataset/regression_melbourne_house_data.csv"
# Load the prepared data
data = pd.read_csv(path)

print("Prepared Data Loaded for Modeling:")

print(f"The data has {data.shape[0]} rows (houses) and {data.shape[1]} columns (information about each house).")
data.head()

Prepared Data Loaded for Modeling:
The data has 13508 rows (houses) and 257 columns (information about each house).


Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Lattitude,Longtitude,...,Postcode_3806.0,Postcode_3807.0,Postcode_3808.0,Postcode_3809.0,Postcode_3810.0,Postcode_3910.0,Postcode_3976.0,Postcode_3977.0,Type_t,Type_u
0,2,1480000.0,2.5,2.0,1.0,1.0,202.0,126.0,-37.7996,144.9984,...,0,0,0,0,0,0,0,0,0,0
1,2,1035000.0,2.5,2.0,1.0,0.0,156.0,79.0,-37.8079,144.9934,...,0,0,0,0,0,0,0,0,0,0
2,3,1465000.0,2.5,3.0,2.0,0.0,134.0,150.0,-37.8093,144.9944,...,0,0,0,0,0,0,0,0,0,0
3,3,850000.0,2.5,3.0,2.0,1.0,94.0,126.0,-37.7969,144.9969,...,0,0,0,0,0,0,0,0,0,0
4,4,1600000.0,2.5,3.0,1.0,2.0,120.0,142.0,-37.8072,144.9941,...,0,0,0,0,0,0,0,0,0,0


We would drop the high correlated features as indicated by the heatmap in the [data analysis notebook](https://github.com/marcndo/ml_foundation/blob/main/EDA/Melbourne_data_analysis.ipynb), there is a high correlation between the features 'Bedrooms2' and 'Rooms'

In [8]:
data.drop('Rooms', axis=1)

Unnamed: 0,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Lattitude,Longtitude,Propertycount,...,Postcode_3806.0,Postcode_3807.0,Postcode_3808.0,Postcode_3809.0,Postcode_3810.0,Postcode_3910.0,Postcode_3976.0,Postcode_3977.0,Type_t,Type_u
0,1480000.0,2.5,2.0,1.0,1.0,202.0,126.0,-37.79960,144.99840,4019.0,...,0,0,0,0,0,0,0,0,0,0
1,1035000.0,2.5,2.0,1.0,0.0,156.0,79.0,-37.80790,144.99340,4019.0,...,0,0,0,0,0,0,0,0,0,0
2,1465000.0,2.5,3.0,2.0,0.0,134.0,150.0,-37.80930,144.99440,4019.0,...,0,0,0,0,0,0,0,0,0,0
3,850000.0,2.5,3.0,2.0,1.0,94.0,126.0,-37.79690,144.99690,4019.0,...,0,0,0,0,0,0,0,0,0,0
4,1600000.0,2.5,3.0,1.0,2.0,120.0,142.0,-37.80720,144.99410,4019.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13503,1245000.0,16.7,4.0,2.0,2.0,652.0,126.0,-37.90562,145.16761,7392.0,...,0,0,0,0,0,0,0,0,0,0
13504,1031000.0,6.8,3.0,2.0,2.0,333.0,133.0,-37.85927,144.87904,6380.0,...,0,0,0,0,0,0,0,0,0,0
13505,1170000.0,6.8,3.0,2.0,4.0,436.0,126.0,-37.85274,144.88738,6380.0,...,0,0,0,0,0,0,0,0,0,0
13506,2500000.0,6.8,4.0,1.0,5.0,866.0,157.0,-37.85908,144.89299,6380.0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# We want to predict the 'Price' of the house
y = data['Price']

# The rest of the information will help us make the prediction
X = data.drop('Price', axis=1)

print("\nInformation we'll use to predict price (features):")
print(X.head())
print(f"We have {X.shape[1]} pieces of information for each house to help us predict the price.")
print("\nWhat we want to predict (price):")
print(y.head())


Information we'll use to predict price (features):
   Rooms  Distance  Bedroom2  Bathroom  Car  Landsize  BuildingArea  \
0      2       2.5       2.0       1.0  1.0     202.0         126.0   
1      2       2.5       2.0       1.0  0.0     156.0          79.0   
2      3       2.5       3.0       2.0  0.0     134.0         150.0   
3      3       2.5       3.0       2.0  1.0      94.0         126.0   
4      4       2.5       3.0       1.0  2.0     120.0         142.0   

   Lattitude  Longtitude  Propertycount  ...  Postcode_3806.0  \
0   -37.7996    144.9984         4019.0  ...                0   
1   -37.8079    144.9934         4019.0  ...                0   
2   -37.8093    144.9944         4019.0  ...                0   
3   -37.7969    144.9969         4019.0  ...                0   
4   -37.8072    144.9941         4019.0  ...                0   

   Postcode_3807.0  Postcode_3808.0  Postcode_3809.0  Postcode_3810.0  \
0                0                0                0     

Our goal is to predict the 'Price' of a house. 
+ We separate this out into something we call 'y'. The motive behind this is that our predictor variable (house price) is continuous so we would used supervised machine learning(regression technique to be more specific) which requires features (information about the house prices) and target (the actual house prices). In our case, the target is what we call 'y' and the rest of the columns are features.
+ All the other information we have about each house (like number of bedrooms, land size, location) are what we'll use to make that prediction. We call this 'X'.
+ We again look at the first few lines of 'X' and 'y' to make sure things look right.

### Split the Data into Training and Testing Sets
+ We would split the dataset into train_test set and use the training set to train the machine learning algorithm learn the realationship between the features and target variables. 
+ The test set would be use to measure how well this realationship has been learned.
+ 20 % of the dataset would be reserve for testing while 80 % would be use to train the algorithm.

In [10]:
# We split our data into two groups: a training set to teach the computer, and a testing set to see how well it learned.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nData split into training and testing:")
print(f"Training data (used to help the algorithm learn): {X_train.shape[0]} houses")
print(f"Testing data (used to measure how well the algorithm learned): {X_test.shape[0]} houses")


Data split into training and testing:
Training data (used to help the algorithm learn): 10806 houses
Testing data (used to measure how well the algorithm learned): 2702 houses


In [11]:
# Some of our information is in very different ranges (like land size vs. number of bedrooms).
# We scale the numerical information to make it easier for the computer to compare.
numerical_features = X_train.select_dtypes(include=['float64', 'int64']).columns.tolist()

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[numerical_features])
X_test_scaled = scaler.transform(X_test[numerical_features])

X_train[numerical_features] = X_train_scaled
X_test[numerical_features] = X_test_scaled

print("\nScaled Training Data (numerical parts):")
X_train[numerical_features].head()


Scaled Training Data (numerical parts):


Unnamed: 0,Rooms,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Lattitude,Longtitude,Propertycount,...,Postcode_3806.0,Postcode_3807.0,Postcode_3808.0,Postcode_3809.0,Postcode_3810.0,Postcode_3910.0,Postcode_3976.0,Postcode_3977.0,Type_t,Type_u
3647,0.057678,-0.769891,0.081735,-0.772553,0.411577,0.892973,-0.02497,0.141026,0.445852,0.652333,...,-0.031922,-0.013606,-0.00962,-0.00962,-0.013606,-0.021516,-0.016664,-0.02357,-0.30014,-0.530896
5836,1.119456,-0.685006,1.12697,2.093785,3.563594,0.412114,-0.034092,-0.79635,-0.18384,1.313625,...,-0.031922,-0.013606,-0.00962,-0.00962,-0.013606,-0.021516,-0.016664,-0.02357,-0.30014,-0.530896
9494,0.057678,1.114556,1.12697,0.660616,0.411577,2.579431,0.344474,-0.760647,1.723857,1.78669,...,-0.031922,-0.013606,-0.00962,-0.00962,-0.013606,-0.021516,-0.016664,-0.02357,-0.30014,-0.530896
3540,-2.065879,-1.00757,-2.008736,-0.772553,0.411577,-1.078778,-0.034092,0.248263,-0.611954,-0.499756,...,-0.031922,-0.013606,-0.00962,-0.00962,-0.013606,-0.021516,-0.016664,-0.02357,-0.30014,1.883608
5459,0.057678,-0.600121,0.081735,2.093785,0.411577,-0.556506,0.159752,0.075422,-1.01895,-1.146726,...,-0.031922,-0.013606,-0.00962,-0.00962,-0.013606,-0.021516,-0.016664,-0.02357,-0.30014,-0.530896


Many machine learning algorithms rely on measuring distances between data points. If we don't scale our features, those with inherently larger values (like land size compared to the number of bedrooms) can disproportionately influence the model's predictions simply because of their magnitude. This can falsely suggest that these high-scale features are more critical in predicting the outcome variable than others, potentially masking the true importance of features with smaller scales.

### Before parameter optimization

In [21]:
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1000),
    "Lasso Regression": Lasso(alpha=10),
    "Support Vector Regression": SVR(kernel='linear',gamma='auto', C=10),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_depth=None,bootstrap=True,random_state=42),
    "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=42)
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'RMSE': rmse, 'R-squared': r2}

print("\nModel Performance:")
for name, scores in results.items():
    print(f"{name}: RMSE = ${scores['RMSE']:.2f}, R-squared = {scores['R-squared']:.4f}")

# Select the best model based on R-squared for the Gradio interface
best_model_name = max(results, key=lambda k: results[k]['R-squared'])
best_model = models[best_model_name]

# Save the best model and scaler
pickle.dump(best_model, open('best_house_price_model.pkl', 'wb'))
pickle.dump(scaler, open('scaler.pkl', 'wb'))

print(f"\nBest model ({best_model_name}) and scaler saved.")



Model Performance:
Linear Regression: RMSE = $599681907430005120.00, R-squared = -873467188889441206796288.0000
Ridge Regression: RMSE = $369658.02, R-squared = 0.6681
Lasso Regression: RMSE = $364716.64, R-squared = 0.6769
Support Vector Regression: RMSE = $547372.91, R-squared = 0.2723
Random Forest Regressor: RMSE = $284010.78, R-squared = 0.8041
Gradient Boosting Regressor: RMSE = $305397.92, R-squared = 0.7735

Best model (Random Forest Regressor) and scaler saved.


In [None]:
# Define the hyperparameter search spaces for each model
param_spaces = {
    "Ridge Regression": {
        'alpha': np.logspace(-6, 6, 13),
        'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'],
    },
    "Lasso Regression": {
        'alpha': np.logspace(-6, 6, 13),
        'max_iter': [10000]
    },
    "Support Vector Regression": {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    "Random Forest Regressor": {
        'n_estimators': [100, 200, 300],
        'max_depth': [None, 5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'bootstrap': [True, False]
    },
    "Gradient Boosting Regressor": {
        'n_estimators': [100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
}

models_to_tune = ["Ridge Regression", "Lasso Regression", "Support Vector Regression", "Random Forest Regressor", "Gradient Boosting Regressor"]
best_params = {}
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform hyperparameter tuning for each model separately
for name in models_to_tune:
    print(f"Tuning hyperparameters for {name}...")
    model = None
    search = None

    if name == "Ridge Regression":
        model = Ridge()
        search = GridSearchCV(estimator=model, param_grid=param_spaces[name], scoring='r2', cv=kf, n_jobs=-1)
    elif name == "Lasso Regression":
        model = Lasso()
        search = GridSearchCV(estimator=model, param_grid=param_spaces[name], scoring='r2', cv=kf, n_jobs=-1)
        break
    elif name == "Support Vector Regression":
        model = SVR()
        search = RandomizedSearchCV(estimator=model, param_distributions=param_spaces[name], n_iter=10, scoring='r2', cv=kf, n_jobs=-1, random_state=42)
    elif name == "Random Forest Regressor":
        model = RandomForestRegressor(random_state=42)
        search = RandomizedSearchCV(estimator=model, param_distributions=param_spaces[name], n_iter=10, scoring='r2', cv=kf, n_jobs=-1, random_state=42)
    elif name == "Gradient Boosting Regressor":
        model = GradientBoostingRegressor(random_state=42)
        search = RandomizedSearchCV(estimator=model, param_distributions=param_spaces[name], n_iter=10, scoring='r2', cv=kf, n_jobs=-1, random_state=42)

    if search:
        search.fit(X_train_scaled, y_train)
        best_params[name] = search.best_params_
        print(f"Best parameters for {name}: {search.best_params_}")
        print(f"Best R-squared on cross-validation for {name}: {search.best_score_:.4f}\n")

print("\nOptimal Parameters for Each Model:")
for name, params in best_params.items():
    print(f"{name}: {params}")

### After parameter optimization

In [None]:
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1000),
    "Lasso Regression": Lasso(alpha=10),
    "Support Vector Regression": SVR(kernel='linear',gamma='auto', C=10),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_depth=None,bootstrap=True,random_state=42),
    "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=42)
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'RMSE': rmse, 'R-squared': r2}

print("\nModel Performance:")
for name, scores in results.items():
    print(f"{name}: RMSE = ${scores['RMSE']:.2f}, R-squared = {scores['R-squared']:.4f}")

# Select the best model based on R-squared for the Gradio interface
best_model_name = max(results, key=lambda k: results[k]['R-squared'])
best_model = models[best_model_name]

# Save the best model and scaler
pickle.dump(best_model, open('best_house_price_model.pkl', 'wb'))
pickle.dump(scaler, open('scaler.pkl', 'wb'))

print(f"\nBest model ({best_model_name}) and scaler saved.")
