## Exploratory Data Analysis - House Price Prediction
### Introduction
This notebook presents an **exploratory data analysis (EDA)** of a residential housing dataset from Kaggle. The goal of this analysis is to identify key factors that influence house prices and to prepare the dataset for further predictive modeling.

**Dataset:** Housing Price Prediction Data (Kaggle)

**Objective:** Explore and visualize the dataset to gain insights and guide further analysis.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**EDA Steps:**
1. Import Libraries and Load Data
2. Dataset Overview
3. Data Quality Check
4. Missing Values Analysis
5. Target Variable Analysis
6. Categorical Features
7. Numerical Features
8. Feature Relationships
9. Outlier Detection

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

df = pd.read_csv("../data/raw/housing_price_dataset.csv")
df.columns = df.columns.str.strip()

print(f"Dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
df.head()

### 2. Dataset Overview

In [None]:
print("Basic Info")
df.info()

print("\nDescriptive Statistics")
display(df.describe().round(2))

print("\nCategorical Columns:", df.select_dtypes(include='object').columns.tolist())
print("Numerical Columns:", df.select_dtypes(include=np.number).columns.tolist())

### 3. Data Quality Check

In [None]:
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

if duplicate_count > 0:
    print("Duplicate rows detected")
else:
    print("No duplicate rows detected")

negative_prices = (df['Price'] < 0).sum()
print(f"\nNumber of records with negative price: {negative_prices}")
if negative_prices > 0:
    print("Negative house prices found ")
    print(f"Lowest price: {df['Price'].min():.2f}")

print(f"\nBasic data checks:")
print(f"- Bedrooms range: {df['Bedrooms'].min()} to {df['Bedrooms'].max()}")
print(f"- Bathrooms range: {df['Bathrooms'].min()} to {df['Bathrooms'].max()}")
print(f"- SquareFeet range: {df['SquareFeet'].min()} to {df['SquareFeet'].max()} sq ft")

### 4. Missing Values Analysis

In [None]:
missing = df.isnull().sum()
if missing.sum() > 0:
    plt.figure(figsize=(8, 4))
    sns.barplot(x=missing[missing > 0].index, y=missing[missing > 0].values, color='salmon')
    plt.title("Missing Values by Feature")
    plt.ylabel("Count")
    plt.xticks(rotation=45)
    plt.show()
else:
    print("No missing values detected.")

### 5. Target Variable Analysis

In [None]:
print(f"Average price: {df['Price'].mean():.2f}")
print(f"Median price: {df['Price'].median():.2f}")
print(f"Price range: {df['Price'].min():.2f} to {df['Price'].max():.2f}")
print(f"Standard deviation: {df['Price'].std():.2f}")

target = 'Price'

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.histplot(df[target], bins=25, kde=True, color='skyblue')
plt.title(f"{target} Distribution")

plt.subplot(1, 2, 2)
sns.boxplot(y=df[target], color='lightcoral')
plt.title(f"{target} Boxplot")

plt.tight_layout()
plt.show()

### 6. Categorical Features

In [None]:
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=col, y=target, data=df)
    plt.title(f"{target} by {col}")
    plt.xticks(rotation=30)
    plt.tight_layout()
    plt.show()

### 7. Numerical Features

In [None]:
num_cols = df.select_dtypes(include=np.number).drop(columns=[target], errors='ignore')
if num_cols.shape[1] == 0:
    print("No numerical features to plot.")
else:
    num_cols.hist(bins=20, figsize=(12, 8), color='skyblue', edgecolor='black')
    plt.suptitle("Distributions of Numerical Features")
    plt.tight_layout()
    plt.show()

### 8. Feature Relationship

In [None]:
corr = df.corr(numeric_only=True)
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix")
plt.show()

print("Top correlations with Price:")
print(corr['Price'].sort_values(ascending=False)[1:4])

### 9. Outlier Detection

In [None]:
Q1, Q3 = df[target].quantile([0.25, 0.75])
IQR = Q3 - Q1
low, high = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR

plt.figure(figsize=(10, 4))
sns.scatterplot(x=range(len(df)), y=df[target], alpha=0.7)
plt.axhline(low, color='r', linestyle='--', label='Lower Bound')
plt.axhline(high, color='r', linestyle='--', label='Upper Bound')
plt.title("Price Outlier Detection")
plt.legend()
plt.show()

outliers = df[(df[target] < low) | (df[target] > high)]
print(f"Number of outliers: {len(outliers)} / {len(df)} ({len(outliers)/len(df)*100:.1f}%)")

### Conclusion
The exploratory data analysis provided a comprehensive understanding of the dataset's structure, distributions, and key relationships. By visualizing missing values, outliers, and feature correlations, we identified the most influential variables and potential data quality issues. These insights form a solid foundation for effective data cleaning, feature engineering, and model development in the subsequent steps of the pipeline.