
# Model Building for House Prices Prediction

## 1. Introduction
In this notebook, we will build and evaluate different models to predict house prices using the cleaned and engineered dataset. We will:
1. Load the cleaned dataset from the feature engineering stage.
2. Split the dataset into training and testing sets.
3. Build and evaluate baseline models (e.g., Linear Regression).
4. Train more advanced models (e.g., Random Forest and Gradient Boosting).
5. Perform hyperparameter tuning.
6. Save the models for future use.



## 2. Loading the Cleaned Dataset
We will load the cleaned and engineered dataset that was saved during the feature engineering phase.


In [1]:

# Load the cleaned and engineered dataset
import pandas as pd

train_cleaned = pd.read_csv('cleaned_train_data.csv')

# Display the shape and preview the dataset
print(f"Cleaned dataset shape: {train_cleaned.shape}")
train_cleaned.head()


Cleaned dataset shape: (1460, 267)


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,TotalSF,Age
0,1,60,65.0,-0.207142,7,5,2003,2003,196.0,706,...,False,False,True,False,False,False,True,False,-0.001277,-1.043259
1,2,20,80.0,-0.091886,6,8,1976,1976,0.0,978,...,False,False,True,False,False,False,True,False,-0.052407,-0.183465
2,3,60,68.0,0.07348,7,5,2001,2002,162.0,486,...,False,False,True,False,False,False,True,False,0.169157,-0.977121
3,4,70,60.0,-0.096897,7,5,1915,1970,0.0,216,...,False,False,True,False,False,False,False,False,-0.114493,1.800676
4,5,60,84.0,0.375148,8,5,2000,2000,350.0,655,...,False,False,True,False,False,False,True,False,0.944631,-0.944052



## 3. Splitting Data into Train and Test Sets
We will split the dataset into training and testing sets. This allows us to evaluate how well our model generalizes to unseen data.


In [2]:

from sklearn.model_selection import train_test_split

# Define the target and features
X = train_cleaned.drop('SalePrice', axis=1)
y = train_cleaned['SalePrice']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the training and testing sets
X_train.shape, X_test.shape


((1168, 266), (292, 266))


## 4. Baseline Model: Linear Regression
We will start by training a baseline Linear Regression model to get a benchmark for model performance.


In [3]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Train a Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lr = lr.predict(X_test)

# Calculate RMSE for the baseline model
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
print(f'Baseline Model RMSE (Linear Regression): {rmse_lr}')


Baseline Model RMSE (Linear Regression): 82939.55901469445



## 5. Advanced Models: Random Forest and Gradient Boosting
Next, we will train more advanced models, such as Random Forest and Gradient Boosting, to improve performance.


In [4]:

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Train a Random Forest model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Make predictions and calculate RMSE for Random Forest
y_pred_rf = rf.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print(f'Random Forest RMSE: {rmse_rf}')

# Train a Gradient Boosting model
gb = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb.fit(X_train, y_train)

# Make predictions and calculate RMSE for Gradient Boosting
y_pred_gb = gb.predict(X_test)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
print(f'Gradient Boosting RMSE: {rmse_gb}')


Random Forest RMSE: 30012.142611043513
Gradient Boosting RMSE: 28491.349703843494



## 6. Hyperparameter Tuning
We will perform hyperparameter tuning for the Gradient Boosting model using GridSearchCV to optimize model performance.


In [5]:

from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Gradient Boosting
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5]
}

# Perform GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(GradientBoostingRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error', verbose=1)
grid_search.fit(X_train, y_train)

# Get the best parameters and the corresponding RMSE
best_gb = grid_search.best_estimator_
y_pred_best_gb = best_gb.predict(X_test)
rmse_best_gb = np.sqrt(mean_squared_error(y_test, y_pred_best_gb))
print(f'Best Gradient Boosting RMSE after tuning: {rmse_best_gb}')


Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best Gradient Boosting RMSE after tuning: 27895.558275340863



## 7. Saving the Trained Models
We will save the trained models (Random Forest, Gradient Boosting, and the tuned Gradient Boosting model) to files for future use.


In [6]:

import joblib

# Save the trained Random Forest and Gradient Boosting models
joblib.dump(rf, 'random_forest_model.pkl')
joblib.dump(gb, 'gradient_boosting_model.pkl')
joblib.dump(best_gb, 'tuned_gradient_boosting_model.pkl')

print("Models saved as 'random_forest_model.pkl', 'gradient_boosting_model.pkl', and 'tuned_gradient_boosting_model.pkl'.")


Models saved as 'random_forest_model.pkl', 'gradient_boosting_model.pkl', and 'tuned_gradient_boosting_model.pkl'.



## 8. Summary and Next Steps
In this notebook, we trained and evaluated several models, including:
- Baseline Linear Regression model
- Random Forest model
- Gradient Boosting model

We also performed hyperparameter tuning to optimize the performance of the Gradient Boosting model, and saved all the trained models for future use.
