## Feature Engineering & Preprocessing (Preparing ML-Ready Data)

Feature Engineering is the process of creating, selecting, and transforming variables (features) to improve the performance of machine learning models.

In this notebook, we will apply several techniques to transform the cleaned dataset into a more informative and ML-ready format.

### Objectives of this step:

1. **Remove or encode non-numeric columns**  
   Convert categorical variables to numeric formats using techniques such as One-Hot Encoding or Label Encoding.

2. **Handle date or time-related features**  
   Derive new features from construction year (e.g., building age), or other temporal indicators.

3. **Create new derived features**  
   Add variables like:
   - `building_age` = `current_year - buildingConstructionYear`

4. **(Optional) Normalize or scale selected features**  
   This step is not required for tree-based models (e.g., XGBoost, Random Forest), but might be included later if experimenting with distance-based algorithms.

5. **Reduce dimensionality or drop irrelevant features**  
   Focus only on the most relevant variables based on domain knowledge or correlation analysis.

At the end of this notebook, we will generate a dataset ready for training baseline models.


In [1]:
import sys, os

# Add the project root to the Python path
project_root = os.path.abspath("../..")
sys.path.append(project_root)

# Imports from local modules
import pandas as pd
from utils.data_cleaner import DataCleaner
from utils.data_loader import DataLoader
from utils.experiment_tracker import ExperimentTracker
from utils.constants import CLEANED_DIR

tracker = ExperimentTracker()
last_cleanded_dataset = tracker.get_latest_cleaned_file(CLEANED_DIR)

# Replace boolean values with 0 and 1, drop unnecessary columns, and clean the DataFrame
loader = DataLoader(last_cleanded_dataset)
df = loader.load_data()

df = loader.clean_booleans(df, bool_cols=["hasLivingRoom", "hasTerrace"])

df = loader.drop_columns(df, columns_to_drop=["Unnamed: 0", "id", "url"])
df = loader.drop_na_targets(df, target_col="price")

df.head()


Unnamed: 0,type,subtype,bedroomCount,bathroomCount,province,locality,postCode,habitableSurface,buildingCondition,buildingConstructionYear,...,kitchenType,hasLivingRoom,toiletCount,hasTerrace,epcScore,price,log_price,is_big_property,room_count,surface_per_room
0,APARTMENT,APARTMENT,2.0,1.0,Brussels,Etterbeek,1040,100.0,GOOD,2004.0,...,SEMI_EQUIPPED,1,1.0,1,C,399000.0,12.896719,0,3.0,33.333333
1,APARTMENT,APARTMENT,2.0,1.0,Brussels,Etterbeek,1040,87.0,AS_NEW,1970.0,...,HYPER_EQUIPPED,1,1.0,1,F,465000.0,13.049795,0,3.0,29.0
2,APARTMENT,FLAT_STUDIO,1.0,1.0,Brussels,Etterbeek,1040,71.0,AS_NEW,1906.0,...,INSTALLED,0,1.0,0,E,289000.0,12.574185,0,2.0,35.5
3,APARTMENT,APARTMENT,2.0,1.0,Brussels,ETTERBEEK,1040,90.0,TO_BE_DONE_UP,1958.0,...,INSTALLED,1,1.0,1,D,375000.0,12.834684,0,3.0,30.0
4,APARTMENT,APARTMENT,1.0,1.0,Brussels,Etterbeek,1040,93.0,TO_BE_DONE_UP,1947.0,...,,1,1.0,1,F,297000.0,12.601491,0,2.0,46.5


In [2]:
#  Create new derived features BEFORE pipeline
df["building_age"] = 2025 - df["buildingConstructionYear"]

In [3]:
from utils.preprocessing_pipeline import PreprocessingPipeline


# 2. Initialize preprocessing pipeline


pipeline = PreprocessingPipeline(
    df=df,
    target_col="price",
    drop_cols=["price_per_m2", "log_price"],  
)

df_encoded = pipeline.fit_transform()

# Check if unwanted columns are still present
for col in ["price_per_m2", "log_price"]:
    if col in df_encoded.columns:
        print(f"❌ Unwanted column still present: {col}")
    else:
        print(f"[INFO] Column removed: {col}")


# 3. Save full and sample dataset

import os
import shutil
from utils.constants import ML_READY_DIR, ML_READY_DATA_FILE, ML_READY_SAMPLE_XLSX


# Clean and recreate ml_ready directory
if os.path.exists(ML_READY_DIR):
    shutil.rmtree(ML_READY_DIR)
os.makedirs(ML_READY_DIR, exist_ok=True)

# Save to CSV and Excel
df_encoded.to_csv(ML_READY_DATA_FILE, index=False)
df_encoded.head(10).to_excel(ML_READY_SAMPLE_XLSX, index=False)

print(f"Dataset ready. Shape: {df_encoded.shape}")
print(f"Saved to: {ML_READY_DATA_FILE}")
print(f"Excel sample: {ML_READY_SAMPLE_XLSX}")

[INFO] Column removed: price_per_m2
[INFO] Column removed: log_price
Dataset ready. Shape: (19013, 2837)
Saved to: e:\_SoftEng\_BeCode\real-estate-price-predictor\data\ml_ready\immoweb_real_estate_ml_ready.csv
Excel sample: e:\_SoftEng\_BeCode\real-estate-price-predictor\data\ml_ready\immoweb_real_estate_ml_ready_sample10.xlsx
