First we will import the regular Python libraries that we have been using.

Normally you would import everything at the top here, but for demonstration purposes we will import functions as we need them.

In [None]:
# IMPORTS

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# # libraries and functions used throughout:

# from sklearn.model_selection import train_test_split

# from sklearn.pipeline import Pipeline
# from sklearn.impute import SimpleImputer
# from sklearn.preprocessing import StandardScaler, OneHotEncoder
# from sklearn.compose import ColumnTransformer

# from sklearn.ensemble import RandomForestRegressor

# from sklearn.metrics import mean_squared_error, r2_score

# from sklearn.model_selection import GridSearchCV

Now, we will load the data.

This dataset is from: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

The way this dataset was set up was with the training data in one file which includes the target variable "SalePrice" (the depdendent variable that we want to predict). So we load them as seperate dataframes.

In [None]:
# LOAD DATA

folder = 'house-prices-advanced-regression-techniques/'
fn_train = 'train.csv'
fn_test = 'test.csv'

With the data loaded, we now want to do some exploratory data analysis, also called EDA.

First we use the dataframe.describe() function to get a general overview of the statistics on each feature.

In [None]:
# use df.describe() and df.info()

Next what we want to do is take a closer look at the target and also some of the features we think
are the most interesting. What features do we think would be most important towards predicting the sale price?

One thing we could do is do a pair plot of all the features. I won't do that right now because when I tried
it earlier it was taking too long to run, so for this demonstration i will plot the histogram, so the distribution,
of some of the features of interest.

In [None]:
# # here is how we would plot the pairplot. notice how we take just the numerical features, not the categorical ones
# numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

# # create the pairplot using sns.pairplot with the subset of numerical columns
# import seaborn as sns
# sns.pairplot(df[numerical_cols])

# general rule of thumb for number of bins in the square root of the number of datapoints

# plot histograms here

Now let's get an idea of how many NaN values we have in the dataframe. We will need to deal with these.

Here are some ideas to keep in mind when dealing with NaN values in your data:

- What percentage of the data are NaNs? If it's like 4% we can just drop them. If it's like 20% need to do   something. For example we could fill with a zero but need to think about if this makes sense. Could also take the average of the surrounding points.

- This act of filling in NaNs is called **imputation** and we will discuss this.

- Always need to fill or drop nans. Cannot send them to the model!! It's like putting $\infty$ into a math equation.

- Some models might warn you if there are nans.

- How many **ROWS** contain NaNs is a good question to ask.

In [None]:
# print each feature along with its number of NaN values

# calculate nan statistics here

We can see that there are NaN values. We will first define $X$ and $y$ along with splittng the data, and then do the preprocessing.

Preprocessing involes:

- imputing the data (filling in the NaN values)
- scaling the numerical data so that all of the features are in the same range
- encoding the categorical variables

This step is splitting the data from 'train.csv' into training and validation subsets.

The training data will be used to train the model. The validation data will be used to estimate the performance of the model we will set up. In the case of this data, we have the "known target values" for the training (and therefore the validation) data, but the test data has no "known value" (you can check 'test.csv', there is no 'SalePrice' column.

We will use the 80/20 split.

In [None]:
# split for training and testing
from sklearn.model_selection import train_test_split

Now with $X$ and $y$ defined we can carry out the preprocessing.

Sources:

SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

StandardScalar: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

OneHotEncoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

One hot encoding further explained: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

First we get the numeric features as a list, and the categorical features as list.

Next we use the "numeric_transformer" to set how we want to deal with missing values, as well as scaling the features to be similar. The "categorical_transformer" does the same thing with the categorical variables which are described by strings as opposed to numbers.

Last we define the preprocessor, where we apply the numerical transformer to the numerical data and we apply the categorical transformer to the categorical data.

This pre processing will then be combined with the regression model later.

In [None]:
# PREPROCESSING STEPS FOR NUMERICAL AND CATEGORICAL FEATURES

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# this gets the numeric features as a list, these will either be an integer or a float
# numeric_features =

# similarly this gets the categorical features as a list. these are strings and not numbers, for example could be 'yes' or 'no'
# categorical_features =

# this defines how we want to transform the numerical features
# numeric_transformer =


# this defines how we want to transform the categorical features
# categorical_transformer =


# this now applies the column transforms we just defined
# preprocessor =

We are now ready to define our regression model

For this analysis we will be using the scikit-lean RandomForestRegressor. This comes from the sklearn.ensemble library which is a set of **ensemble** methods, meaning methods that involve multiple sub-methods. In this case, the data is put through a random forest before the regression is calculated.

A decision tree in ML is essentially a set of if and else statements to subset data. A decision tree in ML is essentially a set of if and else statements to subset data. A random forest is a collection of decision trees where the if else statements are slightly different to get a more robust estimate.

It is useful here to consider because there are many features to consider which affect the final sale price.

The hyperparameter in the sklearn RandomForestRegressor is called n_estimators which is the number of decision trees. A higher number will vary more of the parameters and can give you a more accurate result, but will take longer to compute. Also you want to be careful of overfitting.

RandomForestRegressor docs: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [None]:
# define random forest regressor model
from sklearn.ensemble import RandomForestRegressor

number_of_trees = 100 # set the number of trees in the forest
# model =

With the Random Forest Regression model defined, we now combine it with the preprocessing that we defined.

Together this forms the 'full_model'. This model will then be trained (also known as 'fit'), and then used to make predictions.

In [None]:
# define full model

# full_model =

With the full model defined, it can now be fit using the **TRAINING DATA**.

This is where the model attempts to learn the underlying patterns.

Since there are many dimensions this is why it is useful to use a regression model that first uses a random forest to subset the data.

In [None]:
# fit model to data

'''
do not need to convert the dataframes to np arrays anymore, FOR THE MOST PART. this is an older practice
check the docs for which ever algorithm you are using
'''


None # I just put this here to suppress the output of the fit. if you want you can delete this line and see what happens

With our model defined and trained, we can now make predictions on the **VALIDATION DATA**.

The terminology might be slightly confusing compared to the ML2 session where we made predictions on the test data.

Terminology in ML can be loose in general. The concepts are more important than the words.

In this specific case, we have the known values of the validation data. Meaning we have the house price of these rows.

We will "predict" the sale price, and then compare these predicted values to the known values in order to get an estimation of the model's performance.

We can evaluate the model performance by calculating the root mean squared error (RMSE) and R square ($R^2$).

RMSE quantifies the average difference between the predicted values by your model and the actual values in your data. The lower the RMSE, the better the model fits the data, indicating a smaller average difference between predictions and actual values.

R square is a statistical measure that represents the proportion of variance (spread) in the target variable that your model explains. A higher R square value (closer to 1) indicates a better fit, meaning your model explains a larger proportion of the variance in the target variable. An R square of 0 means the model explains none of the variance, essentially the same as predicting the average value for all data points.


In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# make predictions on VALIDATION DATA
# y_val_pred =

# evaluate model performance

# calculate mean squared error, this tells us the average squared difference between the predicted and true values. so let's take the square root
# val_rmse =

# r^2 tells us how well the model explains the variance in the data. an r^2 of 1 is a perfect correlation and a r^2 of  0 is no correlation
# val_r2 =

# print out the metrics we just calculated
print(f"Validation RMSE: {val_rmse}") # this number is very large so we print the log10 of it
print(f"Validation R2 Score: {val_r2}")

The $R^2$ we calculated looks good but the RMSE seems high. Let's compare it to the mean of the training target to get an idea of the RMSE as a percentage of the mean value.

In [None]:
# print ratio of RMSE to mean training sale price

# so the error is like 15%

We now have an idea of the model's performance from the root mean squared error and $R^2$ (R squared) metrics.

Let's now make a scatter plot comparing the "predicted" values to the actual values.

Keep in mind this plot is telling us how well we made predictions on the **TRAINING DATA**. So it should follow very close to a straight line with slope = 1.

In [None]:
# plot actual versus predicted for training set

# this is how well the model fits the training set it was trained on

# make prediction on training price data and plot it against the actual price data

Now let's see how well the model generalizes to the "unknown" data which is in the **VALIDATION SET**.

In this specific instance what I mean by unknown is that the model did not see this data during the training step.

In [None]:
# plot actual vs predicted for validation set

# this shows how the model performs on data it wasn't trained on. gives idea of how well it generalizes to new data

plt.figure(dpi = 100)
plt.scatter(y_val, y_val_pred, alpha=0.5)
plt.plot([min(y_val), max(y_val)], [min(y_val), max(y_val)], '--', color='red')
plt.title("Actual vs Predicted (Validation)")
plt.xlabel("Actual SalePrice")
plt.ylabel("Predicted SalePrice")
plt.show()

The model is working ok, but could it be improved?

Let's use hyperparameter tuning to find the best number of trees to use in the forest. This will effect how the data gets put into subsets before calculating the regression.

In [None]:
# HYPERPARAMETER TUNING

from sklearn.model_selection import GridSearchCV

# define hyperparameters and their range
# param_grid =

# create gridsearchcv object
# grid_search =


# now use the grid search to fit the training data to find the optimal hyperparameters (in this case just one)

# get the best parameters (the n_estimators which resulted in the best r^2 value)
# best_params =

# print the best parameter
print(best_params)

# {'regressor__n_estimators': 150}

With our optimized value for n_estimators we can now re fit the model and see if the perfomance shifts.

In [None]:
# define random forest regressor model
from sklearn.ensemble import RandomForestRegressor

# number_of_trees =
# model_optimized =
# define full optimized model
# full_model_optimized =

# fit model to data
# full_model_optimized.

# make predictions on validation data with optimized model
# y_val_pred =

# calculate metrics on optimized model
val_mse = mean_squared_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)
print(f"log Validation MSE: {np.log10(val_mse)}")
print(f"Validation R2 Score: {val_r2}")

Our metrics don't change too much... let's plot the new predictions on the validation data and examine.

In [None]:
plt.figure(dpi = 100)
plt.scatter(y_val, y_val_pred, alpha=0.5)
plt.plot([min(y_val), max(y_val)], [min(y_val), max(y_val)], '--', color='red')
plt.title("Actual vs Predicted (Validation)")
plt.xlabel("Actual SalePrice")
plt.ylabel("Predicted SalePrice")
plt.show()