# 02 - Feature Engineering
## Real Estate Price Prediction

**Author:** Nicolas  
**Date:** 2025-01-09  
**Objective:** Cr√©er et transformer les features pour optimiser la performance des mod√®les ML

---

### Table of Contents
1. [Data Loading](#1-data-loading)
2. [Outlier Treatment](#2-outlier-treatment)
3. [Feature Creation](#3-feature-creation)
4. [Encoding Categorical Variables](#4-encoding-categorical-variables)
5. [Feature Scaling](#5-feature-scaling)
6. [Target Transformation](#6-target-transformation)
7. [Feature Selection](#7-feature-selection)
8. [Train/Test Split](#8-train-test-split)
9. [Export Processed Data](#9-export-processed-data)

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder, OneHotEncoder
from scipy import stats
import warnings
import joblib

warnings.filterwarnings('ignore')

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)

%matplotlib inline

## 1. Data Loading

In [None]:
# Charger les donn√©es
df = pd.read_csv('../data/real_estate_data.csv')

print(f"Dataset charg√©: {df.shape[0]} lignes, {df.shape[1]} colonnes")
df.head()

## 2. Outlier Treatment

In [None]:
# Fonction pour traiter les outliers
def cap_outliers_iqr(df, column, factor=1.5):
    """
    Cap les outliers en utilisant la m√©thode IQR.
    """
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR
    
    # Cap values
    df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
    
    return df, lower_bound, upper_bound

# Traiter les outliers pour les variables cl√©s
print("=" * 80)
print("TRAITEMENT DES OUTLIERS")
print("=" * 80)

outlier_cols = ['price', 'surface_m2', 'age_years']

for col in outlier_cols:
    before = df[col].describe()
    df, lower, upper = cap_outliers_iqr(df, col)
    after = df[col].describe()
    
    print(f"\n{col.upper()}:")
    print(f"  Bornes: [{lower:.2f}, {upper:.2f}]")
    print(f"  Avant: min={before['min']:.2f}, max={before['max']:.2f}")
    print(f"  Apr√®s: min={after['min']:.2f}, max={after['max']:.2f}")

## 3. Feature Creation

In [None]:
print("=" * 80)
print("CR√âATION DE NOUVELLES FEATURES")
print("=" * 80)

# 1. Prix au m¬≤
df['price_per_m2'] = df['price'] / df['surface_m2']

# 2. Surface par pi√®ce
df['surface_per_room'] = df['surface_m2'] / df['rooms']

# 3. Age bins (cat√©gories)
df['age_category'] = pd.cut(df['age_years'], 
                            bins=[-1, 5, 15, 30, 100],
                            labels=['Neuf', 'R√©cent', 'Moyen', 'Ancien'])

# 4. Surface bins
df['surface_category'] = pd.cut(df['surface_m2'],
                                bins=[0, 40, 70, 100, 1000],
                                labels=['Studio', 'Moyen', 'Grand', 'Tr√®s_grand'])

# 5. Score de confort (combinaison de features)
df['comfort_score'] = (df['has_elevator'].astype(int) + 
                       df['has_parking'].astype(int) + 
                       df['has_balcony'].astype(int))

# 6. √âtage cat√©goriel
df['floor_category'] = df['floor'].apply(lambda x: 'RDC' if x == 0 else 
                                         ('Bas' if x <= 2 else 
                                         ('Moyen' if x <= 5 else 'Haut')))

# 7. Ratio rooms/surface (densit√©)
df['room_density'] = df['rooms'] / df['surface_m2']

# 8. Transformation log de la surface (pour lin√©ariser la relation)
df['log_surface'] = np.log1p(df['surface_m2'])

# 9. Interaction ville √ó surface
df['city_surface_interaction'] = df['city'] + '_' + df['surface_category'].astype(str)

# 10. Feature polynomiale pour surface (capturer non-lin√©arit√©)
df['surface_squared'] = df['surface_m2'] ** 2

print(f"\n‚úÖ {10} nouvelles features cr√©√©es")
print(f"Dataset shape: {df.shape}")
print(f"\nNouvelles colonnes:")
new_cols = ['price_per_m2', 'surface_per_room', 'age_category', 'surface_category', 
            'comfort_score', 'floor_category', 'room_density', 'log_surface',
            'city_surface_interaction', 'surface_squared']
for col in new_cols:
    print(f"  - {col}")

In [None]:
# Visualiser l'impact des nouvelles features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Prix par cat√©gorie d'√¢ge
df.boxplot(column='price', by='age_category', ax=axes[0, 0])
axes[0, 0].set_title('Prix par Cat√©gorie d\'√Çge', fontsize=11, fontweight='bold')
axes[0, 0].set_xlabel('Cat√©gorie d\'√Çge')
axes[0, 0].set_ylabel('Prix (‚Ç¨)')

# Prix par cat√©gorie de surface
df.boxplot(column='price', by='surface_category', ax=axes[0, 1])
axes[0, 1].set_title('Prix par Cat√©gorie de Surface', fontsize=11, fontweight='bold')
axes[0, 1].set_xlabel('Cat√©gorie de Surface')
axes[0, 1].set_ylabel('Prix (‚Ç¨)')

# Prix par score de confort
df.boxplot(column='price', by='comfort_score', ax=axes[1, 0])
axes[1, 0].set_title('Prix par Score de Confort', fontsize=11, fontweight='bold')
axes[1, 0].set_xlabel('Score de Confort (0-3)')
axes[1, 0].set_ylabel('Prix (‚Ç¨)')

# Log(surface) vs Prix
axes[1, 1].scatter(df['log_surface'], df['price'], alpha=0.5, s=10)
axes[1, 1].set_title('Prix vs Log(Surface)', fontsize=11, fontweight='bold')
axes[1, 1].set_xlabel('Log(Surface)')
axes[1, 1].set_ylabel('Prix (‚Ç¨)')

plt.tight_layout()
plt.show()

## 4. Encoding Categorical Variables

In [None]:
print("=" * 80)
print("ENCODAGE DES VARIABLES CAT√âGORIELLES")
print("=" * 80)

# Copie pour l'encodage
df_encoded = df.copy()

# 1. One-Hot Encoding pour 'city' (cardinal moyen)
city_dummies = pd.get_dummies(df_encoded['city'], prefix='city', drop_first=True)
df_encoded = pd.concat([df_encoded, city_dummies], axis=1)
print(f"\n‚úÖ One-Hot Encoding: city ‚Üí {len(city_dummies.columns)} nouvelles colonnes")

# 2. Ordinal Encoding pour 'energy_class' (ordre naturel: A > B > ... > G)
energy_mapping = {'A': 7, 'B': 6, 'C': 5, 'D': 4, 'E': 3, 'F': 2, 'G': 1}
df_encoded['energy_class_encoded'] = df_encoded['energy_class'].map(energy_mapping)
print(f"‚úÖ Ordinal Encoding: energy_class ‚Üí energy_class_encoded")

# 3. One-Hot Encoding pour les nouvelles cat√©gories
for col in ['age_category', 'surface_category', 'floor_category']:
    dummies = pd.get_dummies(df_encoded[col], prefix=col, drop_first=True)
    df_encoded = pd.concat([df_encoded, dummies], axis=1)
    print(f"‚úÖ One-Hot Encoding: {col} ‚Üí {len(dummies.columns)} nouvelles colonnes")

# 4. Target Encoding pour 'city_surface_interaction' (haute cardinalit√©)
# Calculer la moyenne du prix par groupe
target_encoding = df_encoded.groupby('city_surface_interaction')['price'].mean().to_dict()
df_encoded['city_surface_encoded'] = df_encoded['city_surface_interaction'].map(target_encoding)
print(f"‚úÖ Target Encoding: city_surface_interaction ‚Üí city_surface_encoded")

print(f"\nShape apr√®s encodage: {df_encoded.shape}")

In [None]:
# S√©lectionner les features finales pour le mod√®le
# Exclure les colonnes originales et les colonnes non n√©cessaires
cols_to_drop = ['city', 'energy_class', 'age_category', 'surface_category', 
                'floor_category', 'city_surface_interaction', 'price_per_m2']

df_model = df_encoded.drop(columns=cols_to_drop, errors='ignore')

print("\n" + "=" * 80)
print("FEATURES FINALES POUR LE MOD√àLE")
print("=" * 80)
print(f"\nNombre total de features: {df_model.shape[1] - 1}")
print(f"\nListe des features:")
feature_cols = [col for col in df_model.columns if col != 'price']
for idx, col in enumerate(feature_cols, 1):
    print(f"  {idx}. {col}")

## 5. Feature Scaling

In [None]:
print("=" * 80)
print("NORMALISATION DES FEATURES")
print("=" * 80)

# S√©parer X et y
X = df_model.drop('price', axis=1)
y = df_model['price']

# Identifier les colonnes num√©riques √† scaler
# (exclure les colonnes one-hot encod√©es qui sont d√©j√† 0/1)
numeric_cols = ['surface_m2', 'rooms', 'age_years', 'floor', 'comfort_score',
                'surface_per_room', 'room_density', 'log_surface', 'surface_squared',
                'city_surface_encoded', 'energy_class_encoded']

# Utiliser RobustScaler (r√©sistant aux outliers)
scaler = RobustScaler()
X_scaled = X.copy()

# Scaler uniquement les colonnes num√©riques qui existent
cols_to_scale = [col for col in numeric_cols if col in X.columns]
X_scaled[cols_to_scale] = scaler.fit_transform(X[cols_to_scale])

print(f"\n‚úÖ {len(cols_to_scale)} features normalis√©es avec RobustScaler")
print(f"\nFeatures scal√©es:")
for col in cols_to_scale:
    print(f"  - {col}")

# Sauvegarder le scaler pour la production
joblib.dump(scaler, '../models/scaler.pkl')
print("\n‚úÖ Scaler sauvegard√©: ../models/scaler.pkl")

## 6. Target Transformation

In [None]:
# Tester la transformation log sur la cible
print("=" * 80)
print("TRANSFORMATION DE LA CIBLE")
print("=" * 80)

# Statistiques avant transformation
print("\nAvant transformation:")
print(f"  Skewness: {y.skew():.3f}")
print(f"  Kurtosis: {y.kurtosis():.3f}")

# Transformation log
y_log = np.log1p(y)

print("\nApr√®s transformation log:")
print(f"  Skewness: {y_log.skew():.3f}")
print(f"  Kurtosis: {y_log.kurtosis():.3f}")

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Distribution originale
axes[0].hist(y, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_title(f'Distribution Originale (Skew: {y.skew():.2f})', 
                  fontsize=12, fontweight='bold')
axes[0].set_xlabel('Prix (‚Ç¨)')
axes[0].set_ylabel('Fr√©quence')

# Distribution log
axes[1].hist(y_log, bins=50, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_title(f'Distribution Log-Transform√©e (Skew: {y_log.skew():.2f})', 
                  fontsize=12, fontweight='bold')
axes[1].set_xlabel('Log(Prix)')
axes[1].set_ylabel('Fr√©quence')

plt.tight_layout()
plt.show()

print("\nüí° La transformation log r√©duit le skewness et rend la distribution plus normale")
print("   ‚Üí Utile pour les mod√®les lin√©aires (LinearRegression, Ridge, Lasso)")
print("   ‚Üí Moins crucial pour les tree-based models (RandomForest, XGBoost)")

## 7. Feature Selection

In [None]:
# Analyse de corr√©lation pour feature selection
print("=" * 80)
print("FEATURE SELECTION - ANALYSE DE CORR√âLATION")
print("=" * 80)

# Calculer corr√©lation avec la cible
correlations = X_scaled.corrwith(y).abs().sort_values(ascending=False)

print("\nTop 15 features les plus corr√©l√©es avec le prix:")
print(correlations.head(15))

# Visualisation
fig, ax = plt.subplots(figsize=(10, 8))
correlations.head(20).plot(kind='barh', ax=ax, edgecolor='black', alpha=0.7)
ax.set_title('Top 20 Features - Corr√©lation avec le Prix', fontsize=12, fontweight='bold')
ax.set_xlabel('Corr√©lation Absolue')
ax.set_ylabel('Features')
plt.tight_layout()
plt.show()

# Identifier les features √† faible corr√©lation (potentiellement √† supprimer)
low_corr_features = correlations[correlations < 0.05]
print(f"\n‚ö†Ô∏è  {len(low_corr_features)} features avec corr√©lation < 0.05:")
if len(low_corr_features) > 0:
    print(low_corr_features)
    print("\nüí° Consid√©rer de retirer ces features pour simplifier le mod√®le")

In [None]:
# Matrice de corr√©lation entre features (d√©tecter multicollin√©arit√©)
corr_matrix = X_scaled.corr().abs()

# Trouver les paires de features tr√®s corr√©l√©es (> 0.9)
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if corr_matrix.iloc[i, j] > 0.9:
            high_corr_pairs.append((
                corr_matrix.columns[i], 
                corr_matrix.columns[j], 
                corr_matrix.iloc[i, j]
            ))

if high_corr_pairs:
    print("\n" + "=" * 80)
    print("‚ö†Ô∏è  FEATURES HAUTEMENT CORR√âL√âES (Multicollin√©arit√©)")
    print("=" * 80)
    for feat1, feat2, corr_val in high_corr_pairs:
        print(f"  {feat1} <-> {feat2}: {corr_val:.3f}")
    print("\nüí° Consid√©rer de supprimer l'une des deux features dans chaque paire")
else:
    print("\n‚úÖ Pas de multicollin√©arit√© d√©tect√©e (corr√©lation > 0.9)")

## 8. Train/Test Split

In [None]:
print("=" * 80)
print("SPLIT TRAIN/TEST")
print("=" * 80)

# Split 80/20 avec stratification par ville (si possible)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, 
    test_size=0.2, 
    random_state=42
)

# Aussi cr√©er les versions log-transform√©es de y
y_train_log = np.log1p(y_train)
y_test_log = np.log1p(y_test)

print(f"\n‚úÖ Split effectu√©:")
print(f"  Train set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X_scaled)*100:.1f}%)")
print(f"  Test set:  {X_test.shape[0]} samples ({X_test.shape[0]/len(X_scaled)*100:.1f}%)")
print(f"  Number of features: {X_train.shape[1]}")

# Statistiques
print(f"\nStatistiques Train set:")
print(f"  Prix moyen: {y_train.mean():,.2f} ‚Ç¨")
print(f"  Prix m√©dian: {y_train.median():,.2f} ‚Ç¨")
print(f"  √âcart-type: {y_train.std():,.2f} ‚Ç¨")

print(f"\nStatistiques Test set:")
print(f"  Prix moyen: {y_test.mean():,.2f} ‚Ç¨")
print(f"  Prix m√©dian: {y_test.median():,.2f} ‚Ç¨")
print(f"  √âcart-type: {y_test.std():,.2f} ‚Ç¨")

## 9. Export Processed Data

In [None]:
print("=" * 80)
print("SAUVEGARDE DES DONN√âES PR√âPAR√âES")
print("=" * 80)

# Sauvegarder les datasets
import pickle

# Cr√©er un dictionnaire avec tous les datasets
data_dict = {
    'X_train': X_train,
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test,
    'y_train_log': y_train_log,
    'y_test_log': y_test_log,
    'feature_names': X_train.columns.tolist(),
    'scaler': scaler
}

# Sauvegarder
with open('../data/processed_data.pkl', 'wb') as f:
    pickle.dump(data_dict, f)

print("\n‚úÖ Donn√©es sauvegard√©es: ../data/processed_data.pkl")
print("\nContenu du fichier:")
print("  - X_train, X_test (features normalis√©es)")
print("  - y_train, y_test (cible originale)")
print("  - y_train_log, y_test_log (cible transform√©e)")
print("  - feature_names (liste des features)")
print("  - scaler (pour la production)")

# Sauvegarder aussi en CSV pour inspection
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

train_df.to_csv('../data/train_processed.csv', index=False)
test_df.to_csv('../data/test_processed.csv', index=False)

print("\n‚úÖ √âgalement sauvegard√© en CSV:")
print("  - ../data/train_processed.csv")
print("  - ../data/test_processed.csv")

---
## Summary

### Features cr√©√©es:
1. **Numerical transformations:**
   - `price_per_m2`: Prix au m¬≤
   - `surface_per_room`: Surface par pi√®ce
   - `log_surface`: Transformation log de la surface
   - `surface_squared`: Surface au carr√© (polynomial)
   - `room_density`: Densit√© (rooms/surface)

2. **Categorical features:**
   - `age_category`: Cat√©gories d'√¢ge (Neuf/R√©cent/Moyen/Ancien)
   - `surface_category`: Cat√©gories de surface (Studio/Moyen/Grand/Tr√®s_grand)
   - `floor_category`: Cat√©gories d'√©tage (RDC/Bas/Moyen/Haut)
   - `comfort_score`: Score 0-3 (elevator + parking + balcony)

3. **Interactions:**
   - `city_surface_encoded`: Interaction ville √ó surface (target encoded)

### Encodings appliqu√©s:
- One-Hot Encoding: city, age_category, surface_category, floor_category
- Ordinal Encoding: energy_class (A=7 ... G=1)
- Target Encoding: city_surface_interaction

### Preprocessing:
- Outliers: capp√©s avec m√©thode IQR
- Scaling: RobustScaler sur features num√©riques
- Target: log transformation disponible (optionnel)

### Next Step:
**Notebook 03 - Model Training**: Entra√Æner et comparer diff√©rents mod√®les ML

---