<a href="https://colab.research.google.com/github/oluwadunni1/House-Pricing-Prediction-/blob/main/House_Pricing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  **House Price Prediction Using Machine Learning**

### Understand the business scenario and problem

This notebook explores the housing dataset through comprehensive Exploratory Data Analysis (EDA) to understand property characteristics, pricing patterns, and key factors influencing house prices, laying the groundwork for building an accurate price prediction model.

### Familiarize  with the Housing dataset

In this dataset, there are 50,000 rows, 6 columns, and these variables:

| Variable         | Description                                      |
| ---------------- | ------------------------------------------------ |
| **SquareFeet**   | Total floor area of the house (in square feet).  |
| **Bedrooms**     | Number of bedrooms in the house.                 |
| **Bathrooms**    | Number of bathrooms in the house.                |
| **Neighborhood** | The area or locality where the house is located. |
| **YearBuilt**    | The year the house was constructed.              |
| **Price**        | The sale price of the house (target variable).   |


### Import Packages

In [1]:
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import copy, math
from sklearn.linear_model import SGDRegressor
import seaborn as sns
# For data manipulation
import numpy as np
import pandas as pd
from pathlib import Path

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For statistical tests
from scipy.stats import levene
from scipy.stats import mannwhitneyu, chi2_contingency, skew, kurtosis, pearsonr, spearmanr, f_oneway
from sklearn.feature_selection import mutual_info_regression
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

### Load Dataset

In [2]:
from google.colab import drive

# This command mounts your Google Drive to the Colab runtime.
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd

# Define the permanent path to your file on Google Drive
# NOTE: Replace 'your_file_name.csv' with your actual file name
file_path = '/content/drive/MyDrive/Colab_data/housing_price_dataset.csv'

# Load the dataframe directly from Drive
df = pd.read_csv(file_path)

print(f"Data loaded successfully with {len(df)} rows.")

Data loaded successfully with 50000 rows.


In [5]:
df.head()

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price
0,2126,4,1,Rural,1969,215355.283618
1,2459,3,2,Rural,1980,195014.221626
2,1860,2,1,Suburb,1970,306891.012076
3,2294,2,1,Urban,1996,206786.787153
4,2130,5,2,Suburb,2001,272436.239065


### Data Exploration

In [6]:
df.shape

(50000, 6)

In [7]:
# Gather basic information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SquareFeet    50000 non-null  int64  
 1   Bedrooms      50000 non-null  int64  
 2   Bathrooms     50000 non-null  int64  
 3   Neighborhood  50000 non-null  object 
 4   YearBuilt     50000 non-null  int64  
 5   Price         50000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 2.3+ MB


In [8]:
# Count 'unknown' values across all columns
(df == 'unknown').sum()

Unnamed: 0,0
SquareFeet,0
Bedrooms,0
Bathrooms,0
Neighborhood,0
YearBuilt,0
Price,0


In [9]:
# Check for duplicates
df.duplicated().sum()

np.int64(0)

### Target Variable Exploration


In [10]:
# Extract target variable
target = df["Price"]


# 1. Summary Statistics

summary_stats = {
    "Mean": target.mean(),
    "Median": target.median(),
    "Standard Deviation": target.std(),
    "Minimum": target.min(),
    "Maximum": target.max(),
    "Skewness": skew(target),
    "Kurtosis": kurtosis(target)
}

print(" Summary Statistics for Price:\n")
for k, v in summary_stats.items():
    print(f"{k}: {v:.4f}")

 Summary Statistics for Price:

Mean: 224827.3252
Median: 225052.1412
Standard Deviation: 76141.8430
Minimum: -36588.1654
Maximum: 492195.2600
Skewness: -0.0083
Kurtosis: -0.4081


The existence of a negative price is a logical impossibility and indicates a data entry or collection error.

Action Required Before Testing

- Remove Invalid Data: Filter the dataset to eliminate all rows where Price $\le 0$.

- Re-run Statistics: Calculate new descriptive statistics on the cleaned data to ensure all values are positive and the distribution remains stable.

In [11]:
df[df['Price'] < 0].shape[0]
df[df['Price'] < 0]


Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price
1266,1024,2,2,Urban,2006,-24715.242482
2310,1036,4,1,Suburb,1983,-7550.504574
3630,1235,3,2,Rural,2012,-19871.251146
4162,1352,5,2,Suburb,1977,-10608.359522
5118,1140,4,1,Urban,2020,-23911.003119
5951,1097,4,3,Rural,1981,-4537.418615
6355,1016,5,2,Rural,1997,-13803.684059
8720,1235,3,1,Urban,1952,-24183.000515
9611,1131,3,3,Urban,1959,-13692.026068
10597,1177,2,3,Urban,2010,-434.097124


In [None]:
df = df[df['Price'] >= 0].copy()
target = df['Price']      # update target variable



In [None]:
# 1. Summary Statistics

summary_stats = {
    "Mean": target.mean(),
    "Median": target.median(),
    "Standard Deviation": target.std(),
    "Minimum": target.min(),
    "Maximum": target.max(),
    "Skewness": skew(target),
    "Kurtosis": kurtosis(target)
}

print(" Summary Statistics for Price:\n")
for k, v in summary_stats.items():
    print(f"{k}: {v:.4f}")

In [None]:
# Create a 1x2 subplot layout
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Set overall style
sns.set_style("whitegrid")
plt.rcParams['font.family'] = 'sans-serif'

# 1. Histogram + KDE (left plot)
sns.histplot(target, kde=True, bins=50, ax=axes[0], color='#2E86AB',
             edgecolor='white', linewidth=0.5, alpha=0.7)
axes[0].set_title("Distribution of Price", fontsize=14, fontweight='bold', pad=15)
axes[0].set_xlabel("Price", fontsize=12, fontweight='500')
axes[0].set_ylabel("Frequency", fontsize=12, fontweight='500')
axes[0].grid(alpha=0.3, linestyle='--')
axes[0].spines['top'].set_visible(False)
axes[0].spines['right'].set_visible(False)

# 2. Boxplot (right plot)
box = sns.boxplot(x=target, ax=axes[1], color='#A23B72', width=0.5)
axes[1].set_title("Price", fontsize=14, fontweight='bold', pad=15)
axes[1].set_xlabel("Price", fontsize=12, fontweight='500')
axes[1].grid(alpha=0.3, linestyle='--', axis='x')
axes[1].spines['top'].set_visible(False)
axes[1].spines['right'].set_visible(False)

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

## Feature Distribution Analysis



In [None]:
features = df.drop(columns=["Price","Neighborhood"]).columns

# Skewness & Kurtosis Table
skew_kurt = {}

for col in features:
    skew_kurt[col] = {
        "Skewness": skew(df[col]),
        "Kurtosis": kurtosis(df[col])
    }

import pandas as pd
skew_kurt_df = pd.DataFrame(skew_kurt).T
print("\n Skewness and Kurtosis for all Features:\n")
display(skew_kurt_df)



In [None]:
# Histograms + KDE & Boxplots
for col in features:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Histogram + KDE (left plot)
    sns.histplot(df[col], kde=True, bins=50, ax=axes[0], color='#2E86AB',
                 edgecolor='white', linewidth=0.5, alpha=0.7)
    axes[0].set_title(f"Distribution of {col}", pad=15)
    axes[0].set_xlabel(col)
    axes[0].set_ylabel("Frequency")
    axes[0].grid(alpha=0.3, linestyle='--')

    # Boxplot (right plot)
    sns.boxplot(x=df[col], ax=axes[1], color='#2E86AB', width=0.5)
    axes[1].set_title(f"Boxplot: {col}", pad=15)
    axes[1].set_xlabel(col)
    axes[1].grid(alpha=0.3, linestyle='--', axis='x')

    plt.tight_layout()
    plt.show()

The continuous features, SquareFeet and YearBuilt, both display remarkably uniform (flat) distributions across their respective ranges (1000-3000 sq. ft. and 1950-2020), indicating a consistent number of properties for almost every value, which suggests that converting YearBuilt to Age and using the raw SquareFeet should be effective. Finally, the count-based features, Bedrooms and Bathrooms, show distinct multimodal (peaked) distributions at integer values (1, 2, 3 for Bathrooms; 2, 3, 4, 5 for Bedrooms), confirming the necessity of One-Hot Encoding to treat them as categorical variables instead of continuous numbers.


## Bivariate Analysis

In [None]:
# Select only numeric features and drop 'id' if present
numeric_df = df.select_dtypes(include=['number']).drop(columns=['id'], errors='ignore')

# Compute correlation matrix
corr_matrix = numeric_df.corr()

# Plot heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(
    corr_matrix,
    annot=True,          # Disable numbers for clarity; enable if needed
    cmap="vlag",
    center=0,
    linewidths=0.3
)

plt.title("Correlation Heatmap of Numeric Features", fontsize=16)
plt.show()


In [None]:

target = "Price"

corr_results = []

for col in numeric_df.columns:
    if col == target:
        continue

    pearson_corr, pearson_p = pearsonr(df[col], df[target])
    spearman_corr, spearman_p = spearmanr(df[col], df[target])

    corr_results.append({
        "Feature": col,
        "Pearson": pearson_corr,
        "Pearson_p": pearson_p,
        "Spearman": spearman_corr,
        "Spearman_p": spearman_p
    })

import pandas as pd
corr_df = pd.DataFrame(corr_results)
corr_df.sort_values("Pearson", ascending=False)


The SquareFeet feature is the most potent predictor, showing a strong linear relationship ($\text{Pearson } r = 0.75$) that is highly statistically significant ($p \ll 0.001$). Furthermore, the Spearman rank correlation ($\rho$) provides a more robust measure for the discrete features, further confirming the negligible impact of Bedrooms and Bathrooms in their raw, numeric form. The YearBuilt feature shows a near-zero correlation ($\text{Pearson } r = -0.0023$) that is also not statistically significant ($p = 0.6088$), strongly confirming its need for transformation into the Age of the House feature.

In [None]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Select features (exclude BPS and id)
X = df[[
    'SquareFeet', 'Bedrooms','YearBuilt',
    'Bathrooms'
]]

# Add constant term
X_const = sm.add_constant(X)

# Compute VIF
vif_data = pd.DataFrame({
    "Feature": X.columns,
    "VIF": [variance_inflation_factor(X_const.values, i+1)
            for i in range(len(X.columns))]
})

vif_data


### Categorical Features


In [None]:
# Set figure
fig, ax = plt.subplots(figsize=(14, 8))

# Boxplot: Housing Price by Neighborhood
sns.boxplot(
    data=df,
    x='Price',
    y='Neighborhood',
    ax=ax
)

ax.set_title('Housing Price Distribution by Neighborhood', fontsize=14)
ax.set_xlabel('Price')
ax.set_ylabel('Neighborhood')

plt.tight_layout()
plt.show()

In [None]:
#group dataframe by neighborhood
grouped_data = [group["Price"].values for name, group in df.groupby("Neighborhood")]

In [None]:
# Levene's test
stat, p = levene(*grouped_data)

print("Levene's test p-value:", p)
if p < 0.05:
    print("❌ Variances are significantly different.")
else:
    print("✅ Variances are equal — assumption met.")

In [None]:
import scipy.stats as stats

stats.probplot(df['Price'], plot=plt)
plt.show()


In [None]:
from scipy.stats import normaltest

for name, group in df.groupby('Neighborhood'):
    stat, p = normaltest(group['Price'])
    print(f"{name}: p-value = {p:.4f}")


In [None]:
from scipy.stats import kruskal

# Kruskal-Wallis test
stat, p = kruskal(*grouped_data)
print("Kruskal-Wallis p-value:", p)

In [None]:
import scikit_posthocs as sp

# Dunn’s test with Bonferroni correction
sp.posthoc_dunn(df, val_col='Price', group_col='Neighborhood', p_adjust='bonferroni')