## CALIFORNIA HOUSING PROJECT

# Part 2: ***Building the model***

We will start with a simple **Linear Regression model**. We then can experiment with more complex models. The most important part is to have an effective baseline model, and the linear regression algorithm is an interesting way to begin, given that we are trying to predict housing prices based on previous data and linear regression is a simple and interpretable model that assumes a linear relationship between features and the target variable (a correlation coefficient close to +1 or -1 indicates a strong linear relationship).

Once we have a baseline model, we can explore other algorithms like Decision Trees, Random Forests, Gradient Boosting Machines, or even Neural Networks.
Furthermore, we could even consider hyperparameter tuning.

We now build our model.

In [19]:
# Importing the necessary libraries for analytics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from tensorflow import keras
from sklearn.preprocessing import StandardScaler

In [2]:
# Importing data for model
data = pd.read_csv('housing_data_with_new_features.csv')
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN,BedroomsPerRoom,RoomsPerHousehold
0,-122.23,37.88,41.0,6.781058,4.867534,5.777652,4.844187,8.3252,452600.0,False,False,False,True,False,0.717813,1.399834
1,-122.22,37.86,21.0,8.86785,7.009409,7.784057,7.037906,8.3014,358500.0,False,False,False,True,False,0.790429,1.260013
2,-122.24,37.85,52.0,7.291656,5.252273,6.20859,5.181784,7.2574,352100.0,False,False,False,True,False,0.720313,1.407171
3,-122.25,37.85,52.0,7.150701,5.463832,6.326149,5.393628,5.6431,341300.0,False,False,False,True,False,0.764097,1.325768
4,-122.25,37.85,52.0,7.395108,5.638355,6.338594,5.560682,3.8462,342200.0,False,False,False,True,False,0.762444,1.329892


So, we begin with a linear regression model.
**Simple linear regression** is a statistical method used to understand the relationship between two variables: a dependent variable (target) and an independent variable (predictor or feature).
The goal is to model the relationship between these two variables by fitting a linear equation to the observed data. This relationship is represented by the equation of a straight line: **y = β0​ + β1​ × x** (for β0​ is the **intercept**, representing the value of y when x is 0; and β1​ is the **slope**, representing the change in y for a one-unit change in x. y and x are the target and the feature, respectively).

For more than one feature (independent variable), we can use an analogous algorithm called **Multiple Linear Regression**. This is the same statistical method but adapted to multiple features, and we can choose how many (or which) features we want. This can be represented by the expression: **y = β0 + β1 × x1 ​+ β2 × x2 ​+ β3 × x3​ + ⋯ + βn ​× xn**, for n features. The β coefficients are known in linear regression as the weights (each β represents the weght of its corresponding feature x).

This is the algorithm we will now use.

Our algorithm needs to know which column of our dataset is the target (y) and what is(are) the feature(s) (X). If we want to use all columns but the target as features, we can just drop the target to define X. If not, we need to choose which features we want.

In [3]:
# Defining, from our DataFrame, the target (y) and the multiple features (X)

X = data.drop('median_house_value', axis = 1)  # Multiple features (if we want all)
#X = data[['total_rooms', 'median_income']]  # Multiple features (if we want to choose)
y = data['median_house_value']  # Target

To properly evaluate the performance of a linear regression model, it's important to split the data into training and testing sets.

In [4]:
# Spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, # feature(s)
                                                    y, # target
                                                    test_size = 0.2, # percentage of the data used for testing (so, 80% for training)
                                                    random_state = 42)  # ensures reproducibility

So, 20% of the data is for testing and 80% is for training. And we have our features and target defined.

#Linear Regression

We are ready to create our regressor (regression model). For this we may use the **LinearRegression()** imported from ***sklearn.linear_model***. Then we can fit the model to our data (train data) and predict on the test data.


In [5]:
# Creating the regressor
RegModel = LinearRegression()

# Fitting the model to the training data and predicting on the testing data
RegModel.fit(X_train, y_train)
y_pred = RegModel.predict(X_test)

# Displaying prediction
predictions_data = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
display(predictions_data)

Unnamed: 0,Actual,Predicted
14278,245800.0,230764.772614
16224,137900.0,149941.727541
7646,218200.0,208835.388274
1402,220800.0,171317.279827
1328,170500.0,225062.311683
...,...,...
8204,500001.0,293204.084805
6206,157900.0,168169.318737
2974,100200.0,115894.630069
13314,127700.0,132892.843995


Evaluating the performance of a regression model typically involves calculating several key metrics that indicate how well the model is predicting the target variable.
There are mainly four common evaluation metrics for regression models: **Mean Absolute Value** (MAE), the average of the absolute errors between the predicted and actual values; **Mean Squared Error** (MSE), the average of the squared errors between the predicted and actual values; **Root Mean Squared Error** (RMSE), the square root of the MSE, which brings the error metric back to the original scale of the target variable; and **R-Squared** (R2), the proportion of the variance in the dependent variable that is predictable from the independent variables, ranging from 0 to 1, where 1 indicates perfect prediction.

We will determine the four of them to evaluate this model.

In [6]:
# Calculating MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

# Calculating MSE
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Calculating RMSE
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Calculating R-squared
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")

Mean Absolute Error (MAE): 48660.75657971959
Mean Squared Error (MSE): 4530030653.760366
Root Mean Squared Error (RMSE): 67305.5024032981
R-squared (R²): 0.6687407117584971


This model has a reasonable level of accuracy with an R² value above 0.6, which indicates that it explains a substantial portion of the variance in housing prices (though there is still around 33% of the variance that isn't explained by your model). However, there is room for improvement, as the MAE and RMSE suggest that the average prediction errors are relatively large.

We already did all the feature engineering that we needed and handled the outliers. Thus, for potential improvements in the evaluation scores there are two things that we could do: **Model Complexity**, exploring more complex models like polynomial regression, decision trees, or ensemble methods that might capture non-linear relationships better;
and **Hyperparameter Tuning**, adjusting parameters of the model to optimize its performance.
However, hyperparameter tuning tends to be more impactful for models that are already more complex (like Random Forest or Gradient Boosting). Hence, we will explore more complex models.

#Random Forest

Let us create a Random Forest Regressor now, and compare the results.

In [7]:
# Creating the regressor
RFModel = RandomForestRegressor()

# Fitting the model to the training data and predicting on the testing data
RFModel.fit(X_train, y_train)
y_pred = RFModel.predict(X_test)

# Displaying prediction
predictions_data = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
display(predictions_data)

Unnamed: 0,Actual,Predicted
14278,245800.0,250101.02
16224,137900.0,152251.00
7646,218200.0,196677.00
1402,220800.0,133322.00
1328,170500.0,157226.99
...,...,...
8204,500001.0,365192.16
6206,157900.0,159597.00
2974,100200.0,83447.00
13314,127700.0,110564.00


In [8]:
# Evaluating the R2 score of the RF model

RFModel.score(X_test, y_test)

0.8194823737656313

This R² value is closer to 1, which indicates that the model is making more accurate predictions. The model captures more than 80% of the variance in housing prices, which is quite good and a significant improvement from the linear regressor (almost 15% more).
Nevertheless, **hyperparameter tuning** can often improve the performance of your Random Forest model.

Hyperparameter tuning is the process of finding the optimal set of hyperparameters for a machine learning model. Hyperparameters are the settings or configurations external to the model that are not learned from the data but set before training (like the learning rate, or the number of trees in a random forest). These differ from model parameters, which are learned during training (like the weights in linear regression).

Common Hyperparameters:

*   **Learning rate**: Controls the size of the steps the model takes while minimizing the loss function;
*   **Number of trees** (in ensemble methods like Random Forests): Controls the number of decision trees in the model;
*   **Max depth** (in decision trees): Controls how deep the decision tree can grow;
*   **Regularization parameters** (in linear models): Helps prevent overfitting by adding a penalty for large coefficients.


Methods for Hyperparameter Tuning:

*   **Grid Search**: An exhaustive search over a specified range of hyperparameters. It tries every combination and picks the best one based on a scoring metric;
*   **Random Search**: Instead of trying every combination, it randomly samples combinations from the hyperparameter space;
*   **Bayesian Optimization**: A more advanced method that models the hyperparameter space probabilistically and chooses the next set of hyperparameters based on past results;
*   **Manual Search**: Adjusting hyperparameters based on intuition and experience.

Often, hyperparameter tuning is combined with **cross-validation**, where the model is trained and validated multiple times on different subsets of the data. This helps ensure the selected hyperparameters perform well across different parts of the dataset.

Let us use **Grid Search**, which helps us find the best combination of hyperparameters by exhaustively searching over a specified parameter grid.


In [9]:
# Defining the model (creating a new regressor)
rfr = RandomForestRegressor()

# Defining the hyperparameters and their respective ranges
param_grid = {
    'n_estimators': [30, 50, 100],
    'max_features': [8, 12, 20]
}

# Initializing GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator = rfr, param_grid = param_grid, cv = 5, scoring = 'r2', n_jobs = -1)

# Fitting GridSearchCV on the training data
grid_search.fit(X_train, y_train)

  pid = os.fork()


In [10]:
# Finding the best estimator

grid_search.best_estimator_

Thus, from the lists of possible options we gave, the combination: max_features = 8, n_estimators = 30, makes the best RF model for our data. Let us use it for prediction.

In [11]:
# Using the best model from Grid Search to predict
best_rfr = grid_search.best_estimator_
y_pred_GS = best_rfr.predict(X_test)

# Displaying prediction
predictions_data_GS = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_GS})
display(predictions_data_GS)

Unnamed: 0,Actual,Predicted
14278,245800.0,238782.0
16224,137900.0,156743.0
7646,218200.0,194577.0
1402,220800.0,140674.0
1328,170500.0,156693.0
...,...,...
8204,500001.0,349455.1
6206,157900.0,161819.0
2974,100200.0,83237.0
13314,127700.0,109748.0


In [12]:
# Evaluating the best model on the test set

r2_score = best_rfr.score(X_test, y_test)
print(f"R-squared (R²) on test data with tuned model: {r2_score}")

R-squared (R²) on test data with tuned model: 0.8215651901503388


The R² score is very similar to what we got before the Grid Search, which indicates that hyperparameter tuning was not beneficial for our model.

# Neural Networks

Housing prices are influenced by a complex interplay of factors, including location, number of rooms, neighborhood characteristics, etc. These relationships might not be purely linear. **Neural networks** excel at capturing non-linear relationships between features and the target variable, which traditional linear models might miss. Unlike traditional models where feature engineering plays a crucial role, neural networks can automatically learn the best representation of features during training. This reduces the need for extensive manual feature engineering.
Given the complexity of predicting housing prices and the potential benefits of using neural networks, it's definitely worth trying a neural network model to see if it improves your results. However, it's important to keep in mind that neural networks come with their own challenges, such as the need for more data, longer training times, and careful tuning of hyperparameters.

Neural networks perform better when the data is scaled. We should then standardize the features.

In [13]:
# Standardizing the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We may start with a simple sequential model, using 'Dense' layers with ReLU activation functions. The final layer should have one neuron (since it’s a regression problem) without any activation function.

In [14]:
# Defining the Neural Network model

NN_Model = keras.Sequential([
                         keras.layers.Dense(64, activation = 'relu', input_shape = (X_train_scaled.shape[1],)), # Input layer
                         keras.layers.Dense(64, activation = 'relu'), # Hidden layer
                         keras.layers.Dense(1)  # Output layer for regression
                        ])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


**Optimizers** are crucial in training neural networks because they determine how the model's weights are updated during training. The goal of training is to minimize the loss function, which measures the difference between the predicted and actual values. Optimizers guide the model in finding the optimal set of weights that result in the lowest possible loss.

So, now we compile the model with an optimizer. We can use the Mean Squared Error (MSE) as the loss function, a common choice for regression problems where you want to minimize the difference between the predicted and actual values.

In [15]:
# Optimizing the model
NN_Model.compile(optimizer = 'adam', loss = 'mse', metrics = ['mae'])  # Add 'mae' to the metrics list

Now we fit the model to our training data, and evaluate the model

In [16]:
NN_Model.fit(X_train, y_train, epochs = 50, batch_size = 32, validation_split = 0.2)

loss, mae = NN_Model.evaluate(X_test, y_test)
print(f"Test MAE: {mae}")  # Mean Absolute Error

Epoch 1/50
[1m409/409[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 54033514496.0000 - mae: 201816.0625 - val_loss: 28880404480.0000 - val_mae: 128782.9922
Epoch 2/50
[1m409/409[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 19833501696.0000 - mae: 103143.8984 - val_loss: 12966811648.0000 - val_mae: 89321.8672
Epoch 3/50
[1m409/409[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 12738825216.0000 - mae: 88730.8594 - val_loss: 12863898624.0000 - val_mae: 88746.8984
Epoch 4/50
[1m409/409[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 12319416320.0000 - mae: 87443.9297 - val_loss: 12734439424.0000 - val_mae: 88709.8125
Epoch 5/50
[1m409/409[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 12545631232.0000 - mae: 88319.7734 - val_loss: 12614011904.0000 - val_mae: 88245.1406
Epoch 6/50
[1m409/409[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 122

In [20]:
y_pred_nn = NN_Model.predict(X_test)
# y_pred_nn = y_pred_nn.flatten()  # Flatten if the output is in a 2D array

# Calculating metrics for Neural Network
mae_nn = mean_absolute_error(y_test, y_pred_nn)
mse_nn = mean_squared_error(y_test, y_pred_nn)
rmse_nn = mse_nn**0.5
r2_nn = r2_score(y_test, y_pred_nn)

print("Neural Network Metrics:")
print("MAE:", mae_nn)
print("MSE:", mse_nn)
print("RMSE:", rmse_nn)
print("R^2:", r2_nn)

[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Neural Network Metrics:
MAE: 52102.86805868837
MSE: 5310367620.447933
RMSE: 72872.26921434472
R^2: 0.6116784338335384


The neural network model in this case does not outperform the linear regression or random forest models based on MAE, MSE, RMSE, and R² metrics.
The random forest model seems to provide the best performance (by far) in terms of R², indicating it might be a better choice for this particular dataset.

# CONCLUSION

In this project, we developed and evaluated several models to predict housing prices in California. Among the models tested, the Random Forest Regressor provided the best performance, with an R² above 80%, outperforming linear regression and even neural networks, by effectively capturing non-linear relationships in the data. This result highlights the importance of using advanced models when dealing with complex datasets like real estate. However, while deep learning can be incredibly powerful, it's not always the best solution for every problem.

Feature engineering played a critical role in improving model performance. The introduction of **BedroomsPerRoom** and **RoomsPerHousehold** as new features added valuable information that the original features alone could not capture. This underscores the necessity of thoughtful feature engineering in data science projects, particularly when working with heterogeneous data.

**Practical Implications**

The model developed in this project could be a valuable tool for real estate professionals, helping them set competitive prices by providing more accurate estimates. Homebuyers and investors could also benefit from these insights, enabling them to make more informed decisions in the housing market.

**Future Work**

To further enhance the model's performance, future work could explore the use of more sophisticated models, such as **Gradient Boosting** Machines or other deep learning techniques. Additionally, incorporating more granular data, such as proximity to amenities or temporal trends, could provide further improvements. Expanding the feature set to include economic indicators or environmental factors may also yield a more comprehensive model.