<h3>1. Introduction</h3>
In this project, my goal is to predict house sale prices in the Ames housing dataset using various regression techniques. The dataset contains a broad range of features describing different aspects of each property, such as neighborhood, lot area, basement condition, garage type, and more. By analyzing and modeling this data, I aim to create a predictive model that can accurately estimate the sale price of a house. I will explore several regression models, including Linear Regression, Polynomial Regression, and Ensemble Models (Random Forest and Gradient Boosting), to determine which approach best captures the underlying patterns in the data.

<h4>Import Libraries</h4>

In [295]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

<h3>2. Data Exploration and Preprocessing</h3>
<h4>Data Exploration</h4>
I started by loading the Ames housing dataset and examining its structure. This dataset includes 2,930 rows and 82 columns, consisting of both numerical and categorical features that describe various characteristics of each property, including the neighborhood, lot area, and conditions of different parts of the house (like basements, garages, and kitchens).

Key data exploration steps included:

 - Checking Missing Values: Many features, such as Lot Frontage, Alley, and Garage Type, contained missing values. Some features, like Pool QC (pool quality), had very few non-null values, suggesting that they were irrelevant for most houses.
 - Understanding Data Types: Features were categorized as numerical (e.g., Lot Area, SalePrice) or categorical (e.g., Neighborhood, House Style).
 - Summary Statistics: Using .describe(), I obtained insights into distributions, ranges, and potential outliers, which informed decisions on scaling and potential outlier handling.

After the initial exploration, I determined that several preprocessing steps were needed to make the data suitable for regression modeling.

In [298]:
data = pd.read_csv("AmesHousing.csv")
data.head()
data.info()  # This provides an overview of data types and missing values
data.describe()  # Summary statistics of numerical columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style   

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
count,2930.0,2930.0,2930.0,2440.0,2930.0,2930.0,2930.0,2930.0,2930.0,2907.0,...,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0
mean,1465.5,714464500.0,57.387372,69.22459,10147.921843,6.094881,5.56314,1971.356314,1984.266553,101.896801,...,93.751877,47.533447,23.011604,2.592491,16.002048,2.243345,50.635154,6.216041,2007.790444,180796.060068
std,845.96247,188730800.0,42.638025,23.365335,7880.017759,1.411026,1.111537,30.245361,20.860286,179.112611,...,126.361562,67.4834,64.139059,25.141331,56.08737,35.597181,566.344288,2.714492,1.316613,79886.692357
min,1.0,526301100.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,12789.0
25%,733.25,528477000.0,20.0,58.0,7440.25,5.0,5.0,1954.0,1965.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,129500.0
50%,1465.5,535453600.0,50.0,68.0,9436.5,6.0,5.0,1973.0,1993.0,0.0,...,0.0,27.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,160000.0
75%,2197.75,907181100.0,70.0,80.0,11555.25,7.0,6.0,2001.0,2004.0,164.0,...,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,213500.0
max,2930.0,1007100000.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,...,1424.0,742.0,1012.0,508.0,576.0,800.0,17000.0,12.0,2010.0,755000.0


In [300]:
data.columns = data.columns.str.strip()
data = data.drop(columns=['Alley', 'Pool QC', 'Fence', 'Misc Feature'])
data.isnull().sum()


Order               0
PID                 0
MS SubClass         0
MS Zoning           0
Lot Frontage      490
                 ... 
Mo Sold             0
Yr Sold             0
Sale Type           0
Sale Condition      0
SalePrice           0
Length: 78, dtype: int64

<h4>Data Preprocessing</h4>
To prepare the data, I applied the following transformations:

1. Handling Missing Values:
   - Numerical Features: For numerical columns with moderate amounts of missing data (e.g., Lot Frontage, Mas Vnr Area), I filled missing values with the median, as this approach minimizes the impact of extreme values.
   - Basement and Garage Features: Features related to the basement (BsmtFin SF 1, Bsmt Full Bath, etc.) and garage (Garage Yr Blt, Garage Cars, etc.) were set to 0 where missing values indicated the absence of that feature. This allowed us to distinguish between homes with and without these amenities.
   - Categorical Features: Missing values in categorical columns (e.g., Garage Type, Bsmt Qual) were filled with "Unknown" to indicate the lack of information. This approach ensures no data is lost due to missing values.

2. Encoding Categorical Variables:

   - I applied One-Hot Encoding to convert categorical variables into a numerical format, creating binary columns for each category. This encoding allows models to interpret categorical information effectively without introducing multicollinearity. The drop_first=True option was used to reduce redundancy.

3. Feature Scaling:

   - Using StandardScaler, I standardized numerical columns to have a mean of 0 and a standard deviation of 1. Scaling improves the performance of certain models, particularly Polynomial Regression, by ensuring that all features contribute equally to the model.

4. Splitting Data into Training and Testing Sets:

   - Finally, I split the dataset into training and testing sets, with 80% of the data allocated for training and 20% for testing. This split enables model evaluation on unseen data to assess generalizability.


In [303]:
# Fill Lot Frontage and Mas Vnr Area with their median values
data['Lot Frontage'] = data['Lot Frontage'].fillna(data['Lot Frontage'].median())
data['Mas Vnr Area'] = data['Mas Vnr Area'].fillna(data['Mas Vnr Area'].median())

# Fill Garage-related columns with 0 where there is likely no garage
data['Garage Yr Blt'] = data['Garage Yr Blt'].fillna(0)
data['Garage Cars'] = data['Garage Cars'].fillna(0)
data['Garage Area'] = data['Garage Area'].fillna(0)

# Fill Basement-related columns with 0 where there is likely no basement
basement_cols = ['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 
                 'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath']
data[basement_cols] = data[basement_cols].fillna(0)

# Fill categorical columns with 'Unknown'
categorical_cols_with_missing = ['Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Bsmt Qual', 
                                 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Electrical']

for col in categorical_cols_with_missing:
    data[col] = data[col].fillna("Unknown")



print("Missing values in X_train:", X_train.isnull().sum().sum())
print("Missing values in X_test:", X_test.isnull().sum().sum())

print("Columns with missing values in X_train:\n", X_train.isnull().sum()[X_train.isnull().sum() > 0])
print("Columns with missing values in X_test:\n", X_test.isnull().sum()[X_test.isnull().sum() > 0])


Missing values in X_train: 0
Missing values in X_test: 0
Columns with missing values in X_train:
 Series([], dtype: int64)
Columns with missing values in X_test:
 Series([], dtype: int64)


In [305]:
data = pd.get_dummies(data, drop_first=True)

# Select only numerical columns (e.g., float and integer types)
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns
scaler = StandardScaler()
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

In [307]:

X = data.drop(columns=['SalePrice'])  # Features
y = data['SalePrice']  # Target variable


# Split data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


<h3>3. Model Implementation</h3>
I implemented four regression models to predict house prices. Here’s a brief description of each model, why it was chosen, and the code used for training and evaluating them:

<h3>Linear Regression</h3> 
Linear Regression is a straightforward model that assumes a linear relationship between features and the target variable. I used it as a baseline model to see how a simple, linear approach performs on this dataset.

In [311]:
# Initialize the model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Predict on the test set
y_pred_lr = linear_model.predict(X_test)

print("Linear Regression Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_lr))
print("MSE:", mean_squared_error(y_test, y_pred_lr))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))
print("R² Score:", r2_score(y_test, y_pred_lr))


Linear Regression Performance:
MAE: 826599.6214399436
MSE: 400296536895215.4
RMSE: 20007412.048918653
R² Score: -318523074565158.25


<h3>Polynomial Regression</h3>
Polynomial Regression allows us to capture non-linear relationships by adding polynomial terms (e.g., squares and interactions of features). I used a degree-2 polynomial to see if introducing non-linearity would improve performance.

In [314]:
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_test_poly)


print("\nPolynomial Regression Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_poly))
print("MSE:", mean_squared_error(y_test, y_pred_poly))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_poly)))
print("R² Score:", r2_score(y_test, y_pred_poly))



Polynomial Regression Performance:
MAE: 0.32850972574011694
MSE: 0.2976051863808963
RMSE: 0.5455320214074479
R² Score: 0.7631902596314633


<h3>Random Forest Regressor</h3>
Random Forest is an ensemble model that builds multiple decision trees and averages their predictions, making it more robust to overfitting and capable of handling non-linear relationships.

In [316]:
# Initialize and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)

print("\nRandom Forest Regression Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_rf))
print("MSE:", mean_squared_error(y_test, y_pred_rf))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("R² Score:", r2_score(y_test, y_pred_rf))



Random Forest Regression Performance:
MAE: 0.1987152005718906
MSE: 0.10830290975308667
RMSE: 0.3290940743208341
R² Score: 0.9138214483031205


<h3>Gradient Boosting Regressor</h3>
Gradient Boosting is another ensemble method that builds models sequentially to minimize errors from previous models. This makes it very effective at capturing complex patterns and typically results in high accuracy.

In [318]:
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)

print("\nGradient Boosting Regression Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_gb))
print("MSE:", mean_squared_error(y_test, y_pred_gb))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_gb)))
print("R² Score:", r2_score(y_test, y_pred_gb))



Gradient Boosting Regression Performance:
MAE: 0.1901368392586956
MSE: 0.0971957414359103
RMSE: 0.311762315612247
R² Score: 0.9226596196986074


<h3>4. Model Evaluation and Comparison</h3>

To evaluate each model, I used the following metrics:
 - Mean Absolute Error (MAE): Average absolute difference between predicted and actual prices.
 - Mean Squared Error (MSE): Average squared difference between predicted and actual prices, penalizing larger errors.
 - Root Mean Squared Error (RMSE): Square root of MSE, providing an interpretable metric in the same units as the target.
 - R² Score: Indicates how well the model explains the variance in the target variable.

<h4>Performance Summary:</h4>

 - Linear Regression: This model performed poorly, with very high MAE, MSE, and RMSE values, and a negative R² score. This indicates that the simple linear approach was not able to capture the complexity of the data, likely due to non-linear relationships among features.

 - Polynomial Regression: By adding polynomial terms, this model showed a significant improvement, with lower error metrics and an R² score of 0.76. However, it still left some variance unexplained, suggesting that a non-parametric model might perform better.

 - Random Forest Regression: The ensemble approach of Random Forest yielded much better performance, with a high R² score of 0.91 and low MAE and RMSE values. This model captured non-linear relationships well and handled complex interactions between features effectively.

 - Gradient Boosting Regression: Gradient Boosting outperformed all other models, with the lowest error metrics and an R² score of 0.92. Its sequential learning approach allowed it to make highly accurate predictions.

<h4>Chosen Model:</h4> 
Based on these evaluations, Gradient Boosting Regression is the best model for predicting house prices in this dataset due to its superior accuracy and ability to handle complex patterns in the data.

<h3>5. Conclusion</h3>
In this project, I explored several regression techniques to predict house prices in Ames. And found that:

 - Handling non-linear relationships is essential, as simpler models like Linear Regression struggled with the data complexity.
 - Ensemble methods (Random Forest and Gradient Boosting) provided the best performance, with Gradient Boosting emerging as the top model.