## Data Preprocessing - House Price Dataset

### Introduction
This notebook performs **data preprocessing** for the residential housing dataset analyzed in `01_exploratory_data_analysis.ipynb`. Based on the insights from the exploratory phase, we will clean and transform the raw dataset to make it suitable for feature engineering and predictive modeling.

**Dataset:** Housing Price Prediction Data (Kaggle)

**Objective:** Produce a clean dataset that can be directly used in the feature engineering stage.

**Author:** NGUYEN Ngoc Dang Nguyen – Final-year Student in Computer Science, Aix-Marseille University

**Preprocessing Steps:**
1. Import Libraries and Load Data
2. Remove Negative Prices
3. Encode Categorical Variable
4. Save Processed Data

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/housing_price_dataset.csv")
df.columns = df.columns.str.strip() 

print(f"Dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

df.head()

### 2. Remove Negative Prices

In [None]:
print(f"Original dataset: {df.shape[0]} rows")

df = df[df['Price'] >= 0].copy()

print(f"After removing negative prices: {df.shape[0]} rows")
print(f"Removed: {50000 - df.shape[0]} rows")

### 3. Encode Categorical Variable

In [None]:
df_encoded = pd.get_dummies(df, columns=['Neighborhood'], drop_first=False)
print(f"Before encoding: {df.shape[1]} columns")
print(f"After encoding: {df_encoded.shape[1]} columns")
print(f"\nNew columns created: {[col for col in df_encoded.columns if 'Neighborhood' in col]}")

### 4. Save Processed Data

In [None]:
df_encoded.to_csv('../data/processed/cleaned_data.csv', index=False)

print(f"Cleaned dataset saved: {df_encoded.shape[0]} rows, {df_encoded.shape[1]} columns")
print("File: ../data/processed/cleaned_data.csv")

### Conclusion

The dataset has been successfully preprocessed and prepared for feature engineering. Starting from 50,000 records, we removed 22 invalid entries with negative prices, resulting in 49,978 clean samples. The categorical variable `Neighborhood` was one-hot encoded into 3 binary features. The cleaned dataset was saved for use in the feature engineering stage.