# Feature Engineering - House Price Dataset
## Introduction
This notebook focuses on **feature engineering** for the house price prediction project. Using the cleaned dataset from `02_data_preprocessing.ipynb`, we will create new meaningful features that can help improve model performance.

**Dataset:** Housing Price Prediction Data (Kaggle)

**Objective:** Create and select new features to improve model performance and prepare the dataset for modeling.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Feature Engineering Steps:**
1. Import Libraries and Load Data
2. Feature Creation
3. Feature Transformation
4. Feature Selection
5. Final Feature Set

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("../data/processed/cleaned_data.csv")

print(f"Cleaned dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

df.head()

### 2. Feature Creation

In [None]:
df['House_Age'] = 2025 - df['YearBuilt']
df['Price_per_SqFt'] = df['Price'] / df['SquareFeet']
df['Total_Rooms'] = df['Bedrooms'] + df['Bathrooms']
df['Room_per_SqFt'] = df['Total_Rooms'] / df['SquareFeet']

df.head()

### 3. Feature Transformation

In [None]:
df['SquareFeet_log'] = np.log1p(df['SquareFeet'])

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(df['SquareFeet'], bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('SquareFeet')

axes[1].hist(df['SquareFeet_log'], bins=30, color='lightgreen', edgecolor='black')
axes[1].set_title('SquareFeet_log')

plt.tight_layout()
plt.show()

### 4. Feature Selection

In [None]:
X = df.drop(columns=['Price', 'YearBuilt'])
y = df['Price']

corr_with_price = df.corr()['Price'].sort_values(ascending=False)

print("Top 10 features correlated with Price:")
print(corr_with_price.head(11))

selected_features = corr_with_price[abs(corr_with_price) > 0.05].index.tolist()
selected_features.remove('Price')

print(f"\nSelected features (|correlation| > 0.05): {len(selected_features)}")
print(selected_features)

### 5. Final Feature Set

In [None]:
final_features = selected_features.copy()

df_final = df[final_features + ['Price']].copy()

print(f"Final dataset shape: {df_final.shape}")
print(f"Features: {len(final_features)}")
print(f"\nFeature list:")
for i, col in enumerate(final_features, 1):
    print(f"{i}. {col}")

df_final.to_csv('../data/processed/engineered_features.csv', index=False)

### Conclusion

The feature engineering process created 4 new features (House_Age, Price_per_SqFt, Total_Rooms, Room_per_SqFt) and applied log transformation to SquareFeet to reduce skewness. Using correlation analysis, we selected features with |correlation| > 0.05 with the target Price. The final engineered dataset is ready for model training.