# Feature Engineering - House Price Dataset
## Introduction
This notebook focuses on **feature engineering** for the house price prediction project. Using the cleaned dataset from `02_data_preprocessing.ipynb`, we will create new meaningful features that can help improve model performance.

**Dataset:** Housing Price Prediction Data (Kaggle)

**Objective:** Create and select new features to improve model performance and prepare the dataset for modeling.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Feature engineering steps:**
1. Load cleaned data and understand feature relationships
2. Create domain-specific features (housing market insights)
3. Generate interaction features between important variables
4. Create polynomial and mathematical transformations
5. Implement binning and categorical feature creation
6. Feature selection and importance analysis
7. Save engineered dataset for modeling

## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from scipy.stats import skew
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

# Load the cleaned dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

print(f"Cleaned dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

print(f"\nSample of the cleaned data:")
df.head()

## 2. Understanding Current Features

In [None]:
print("\nCURRENT FEATURE ANALYSIS")
print("=" * 40)

# Identify different types of features
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
encoded_features = [col for col in df.columns if col.endswith(('_Rural', '_Suburb', '_Urban'))]
scaled_features = [col for col in df.columns if col.endswith('_scaled')]

print(f"Feature inventory:")
print(f"    * Total features: {len(df.columns)}")
print(f"    * Numeric features: {len(numeric_features)}")
print(f"    * Categorical features: {len(categorical_features)}")
print(f"    * Encoded features: {len(encoded_features)}")
print(f"    * Scaled features: {len(scaled_features)}")

# Find the target variable (price)
price_cols = [col for col in numeric_features if 'price' in col.lower() and not col.endswith('_scaled')]
target_col = price_cols[0] if price_cols else numeric_features[0]

print(f"\nTarget variable identified: {target_col}")

# Show the correlation with target
correlations = df[numeric_features].corr()[target_col].abs().sort_values(ascending=False)
print(f"\nTop 5 features correlated with {target_col}:")
for feature, corr in correlations.head(6).items():
    if feature != target_col:
        print(f"    * {feature}: {corr:.3f}")

# Basic feature statistics
print(f"\nFeature statistics:")
print(df[numeric_features].describe().round(2))

## 3. Domain-Specific Feature Creation

In [None]:
print("\nDOMAIN-SPECIFIC FEATURE ENGINEERING")
print("=" * 40)

# Keep track of new features created
new_features = []

# Living Space Features
print("Creating living space features...")

# Total living area features
living_cols = [col for col in df.columns if 'sqft' in col.lower() or 'area' in col.lower()]
if len(living_cols) >= 2:
    # Assume we have sqft_living and sqft_lot or similar
    for i, col1 in enumerate(living_cols):
        for col2 in living_cols[i+1:]:
            if col1 != col2 and not col1.endswith('_scaled') and not col2.endswith('_scaled'):
                # Create ratio feature
                ratio_name = f"{col1}_to_{col2}_ratio"
                df[ratio_name] = df[col1] / (df[col2] + 1)  # Add 1 to avoid division by zero
                new_features.append(ratio_name)
                print(f"  Created {ratio_name}")

# Room density features
room_cols = [col for col in df.columns if 'room' in col.lower() or 'bedroom' in col.lower() or 'bathroom' in col.lower()]
area_cols = [col for col in df.columns if 'sqft' in col.lower() and 'living' in col.lower()]

if room_cols and area_cols:
    for room_col in room_cols:
        for area_col in area_cols:
            if not room_col.endswith('_scaled') and not area_col.endswith('_scaled'):
                density_name = f"{room_col}_per_sqft"
                df[density_name] = df[room_col] / (df[area_col] + 1)
                new_features.append(density_name)
                print(f"  Created {density_name}")

# 3.2 Age and Time Features
print(f"\nCreating age and time features...")

# House age features
year_cols = [col for col in df.columns if 'year' in col.lower() or 'built' in col.lower()]
for col in year_cols:
    if not col.endswith('_scaled') and df[col].dtype in [np.number, 'int64', 'float64']:
        # Current age
        age_col = f"{col}_age"
        df[age_col] = 2024 - df[col]
        new_features.append(age_col)
        print(f"  Created {age_col}")
        
        # Age categories
        age_cat_col = f"{col}_category"
        df[age_cat_col] = pd.cut(df[age_col], 
                                bins=[0, 10, 25, 50, 100, float('inf')], 
                                labels=['New', 'Recent', 'Mature', 'Old', 'Historic'])
        new_features.append(age_cat_col)
        print(f"  Created {age_cat_col}")

# 3.3 Location and Neighborhood Features
print(f"\nCreating location-based features...")

# ZIP code or location features
location_cols = [col for col in df.columns if any(word in col.lower() for word in ['zip', 'location', 'city', 'neighborhood'])]
for col in location_cols:
    if df[col].dtype == 'object' or col.endswith('_encoded'):
        # Average price by location
        location_avg_col = f"{col}_avg_price"
        location_avg = df.groupby(col)[target_col].mean()
        df[location_avg_col] = df[col].map(location_avg)
        new_features.append(location_avg_col)
        print(f"  Created {location_avg_col}")

# 3.4 Price-based Features
print(f"\nCreating price-related features...")

# Price per unit features
size_cols = [col for col in df.columns if any(word in col.lower() for word in ['sqft', 'area', 'size'])]
for col in size_cols:
    if not col.endswith('_scaled') and df[col].dtype in [np.number, 'int64', 'float64']:
        price_per_unit = f"price_per_{col}"
        df[price_per_unit] = df[target_col] / (df[col] + 1)
        new_features.append(price_per_unit)
        print(f"  Created {price_per_unit}")

print(f"\nDomain features created: {len([f for f in new_features if f in df.columns])}")

## 4. Interaction Features

In [None]:
print("\nINTERACTION FEATURE CREATION")
print("=" * 40)

# Find top correlated features for interactions
base_numeric = [col for col in numeric_features if not col.endswith('_scaled') and col != target_col]
top_features = correlations.drop(target_col).head(5).index.tolist()
top_features = [f for f in top_features if f in base_numeric]

print(f"Creating interactions between top {len(top_features)} features: {top_features}")

interaction_features = []

# Create pairwise interactions
for i, feat1 in enumerate(top_features):
    for feat2 in top_features[i+1:]:
        if feat1 != feat2:
            # Multiplication interaction
            mult_name = f"{feat1}_x_{feat2}"
            df[mult_name] = df[feat1] * df[feat2]
            interaction_features.append(mult_name)
            
            # Addition interaction
            add_name = f"{feat1}_plus_{feat2}"
            df[add_name] = df[feat1] + df[feat2]
            interaction_features.append(add_name)
            
            # Ratio interaction (if no zeros)
            if df[feat2].min() > 0:
                ratio_name = f"{feat1}_div_{feat2}"
                df[ratio_name] = df[feat1] / df[feat2]
                interaction_features.append(ratio_name)

print(f"Created {len(interaction_features)} interaction features")

# Show sample interactions
if interaction_features:
    print(f"\nSample interactions:")
    for feat in interaction_features[:5]:
        print(f"  • {feat}")

## 5. Mathematical Transformations

In [None]:
print("\nMATHEMATICAL TRANSFORMATIONS")
print("=" * 40)

transformation_features = []

# Polynomial features for top predictors
print("Creating polynomial features...")

poly_candidates = [col for col in top_features if df[col].min() >= 0]

for col in poly_candidates[:3]:
    # Square
    square_name = f"{col}_squared"
    df[square_name] = df[col] ** 2
    transformation_features.append(square_name)

    # Square root (if all values are non-negative)
    if df[col].min() >= 0:
        sqrt_name = f"{col}_sqrt"
        df[sqrt_name] = np.log(df[col])
        transformation_features.append(sqrt_name)
    
    # Log transformation (if all values are positive)
    if df[col].min() > 0:
        log_name = f"{col}_log"
        df[log_name] = np.log(df[col])
        transformation_features.append(log_name)
    
print(f"Created {len(transformation_features)} mathematical transformations")

# Binning continuous variables
print(f"\nCreating binned cartegorical features...")

binning_features = []

# Bin important continuous features
for col in top_features[:3]:
    if df[col].dtype in [np.number, 'int64', 'float64'] and df[col].nunique() > 10:
        # Create quantile-based bins
        binned_name = f"{col}_binned"
        df[binned_name] = pd.qcut(df[col], q=5, labels=['Low', 'Low-Med', 'Medium', 'Med-High', 'High'])
        binning_features.append(binned_name)
        print(f"  Created {binned_name}")

print(f"Created {len(binning_features)} binned features")
   

## 6. Feature Selection and Analysis

In [None]:
print("\nFEATURE SELECTION AND ANALYSIS")
print("=" * 40)

# Get all features for analysis
all_numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
if target_col in all_numeric_features:
    all_numeric_features.remove(target_col)

# Remove scaled features for now (we'll add them back later)
analysis_features = [col for col in all_numeric_features if not col.endswith('_scaled')]

print(f"Analyzing {len(analysis_features)} features for selection")

# Calculate correlations with target
feature_correlations = df[analysis_features + [target_col]].corr()[target_col].abs().sort_values(ascending=False)
feature_correlations = feature_correlations.drop(target_col)

print(f"\nTop 10 most correlated features with {target_col}:")
for i, (feature, corr) in enumerate(feature_correlations.head(10).items(), 1):
    feature_type = "New" if feature in new_features + interaction_features + transformation_features + binning_features else "Original"
    print(f"{i:2d}. {feature:<30} | {corr:.3f} | {feature_type}")

# Visualize top feature correlations
plt.figure(figsize=(12, 8))
top_features_viz = feature_correlations.head(15)
plt.barh(range(len(top_features_viz)), top_features_viz.values)
plt.yticks(range(len(top_features_viz)), top_features_viz.index)
plt.xlabel('Correlation with Price (absolute value)')
plt.title('Top 15 Features by Correlation with Price')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# Statistical feature selection using SelectKBest
print(f"\nStatistical feature selection (SelectKBest):")
selector = SelectKBest(score_func=f_regression, k=20)  # Select top 20 features
X_selected = selector.fit_transform(df[analysis_features], df[target_col])
selected_features = [analysis_features[i] for i in selector.get_support(indices=True)]

print(f"SelectKBest selected {len(selected_features)} features:")
for i, feature in enumerate(selected_features[:10], 1):  # Show top 10
    feature_type = "New" if feature in new_features + interaction_features + transformation_features + binning_features else "Original"
    print(f"{i:2d}. {feature:<30} | {feature_type}")

## 7. Final Feature Set Creation

In [None]:
print("\nFINAL FEATURE SET CREATION")
print("=" * 40)

# Combine best features for modeling
final_features = []

# 1. Top correlated features
final_features.extend(feature_correlations.head(15).index.tolist())

# 2. Top statistically selected features
final_features.extend(selected_features[:10])

# 3. Remove duplicates and add scaled versions
final_features = list(set(final_features))

# Add scaled versions of numerical features for ML algorithms
scaled_versions = []
for feature in final_features:
    scaled_name = f"{feature}_scaled"
    if scaled_name in df.columns:
        scaled_versions.append(scaled_name)

# Create final modeling dataset
modeling_features = final_features + scaled_versions + [target_col]
modeling_df = df[modeling_features].copy()

print(f"Final feature set summary:")
print(f"  * Original features: {len([f for f in final_features if f not in new_features + interaction_features + transformation_features])}")
print(f"  * Engineered features: {len([f for f in final_features if f in new_features + interaction_features + transformation_features])}")
print(f"  * Scaled versions: {len(scaled_versions)}")
print(f"  * Total features for modeling: {len(modeling_features) - 1}")  # Subtract target

# Check for any remaining missing values
missing_in_final = modeling_df.isnull().sum()
if missing_in_final.sum() > 0:
    print(f"\nMissing values in final dataset:")
    print(missing_in_final[missing_in_final > 0])
else:
    print(f"\nNo missing values in final dataset")

## 8. Feature Engineering Validation

In [None]:
print("\nFEATURE ENGINEERING VALIDATION")
print("=" * 40)

# Correlation heatmap of top features
plt.figure(figsize=(12, 10))
top_features_for_viz = feature_correlations.head(12).index.tolist() + [target_col]
correlation_matrix = df[top_features_for_viz].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix - Top Engineered Features')
plt.tight_layout()
plt.show()

# Feature distribution plots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

top_new_features = [f for f in feature_correlations.head(10).index if f in new_features + interaction_features + transformation_features][:6]

for i, feature in enumerate(top_new_features):
    if i < 6:
        df[feature].hist(bins=30, ax=axes[i], alpha=0.7)
        axes[i].set_title(f'Distribution of {feature}')
        axes[i].set_xlabel(feature)
        axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## 9. Save Engineered Dataset

In [None]:
print("\nSAVING ENGINEERED DATASET")
print("=" * 40)

# Save the complete engineered dataset
engineered_path = '../data/processed/engineered_data.csv'
df.to_csv(engineered_path, index=False)
print(f"Complete engineered dataset saved: {engineered_path}")
print(f"Size: {df.shape[0]} rows x {df.shape[1]} columns")

# Save the modeling-ready dataset
modeling_path = '../data/processed/modeling_data.csv'
modeling_df.to_csv(modeling_path, index=False)
print(f"Modeling dataset saved: {modeling_path}")
print(f"Size: {modeling_df.shape[0]} rows x {modeling_df.shape[1]} columns")

# Save feature information
feature_info = {
    'target_column': target_col,
    'total_features': len(df.columns) - 1,  # Exclude target
    'original_features': len([f for f in df.columns if f not in new_features + interaction_features + transformation_features + binning_features]),
    'domain_features': len(new_features),
    'interaction_features': len(interaction_features),
    'transformation_features': len(transformation_features),
    'binning_features': len(binning_features),
    'final_modeling_features': len(modeling_features) - 1,
    'top_features': feature_correlations.head(10).to_dict(),
    'selected_features_for_modeling': [f for f in modeling_features if f != target_col]
}

# Save feature info
import json
with open('../data/processed/feature_engineering_summary.json', 'w') as f:
    json.dump(feature_info, f, indent=2, default=str)

print(f"Feature engineering summary saved: '../data/processed/feature_engineering_summary.json'")

## Feature Engineering Summary & Next Steps
Through domain-driven feature creation, interaction terms, and careful selection, the dataset now contains informative variables that enhance model learning. The engineered features are ready for use in training and evaluating machine learning models, setting the stage for improved predictive performance.