Get cleaned data into dataframe

Numerical transformations

- Scaling (standardization, normalization)
- Log transforms (reduce skewness)
- Binning
- Polynomial features

Categorical transformations

- One-hot encoding
- Dummy encoding
- Label encoding
- Target/frequency encoding

Other steps

- Train-test split
- Handling imbalance
- Feature selection
- Feature engineering

Goal:
➡️ Convert cleaned data into a form where models can learn patterns effectively.

In [12]:
import pandas as pd
df_preprocessed = pd.read_csv('../data/2_cleaned/cleaned_data.csv')

In [13]:

numeric_features = df_preprocessed.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = df_preprocessed.select_dtypes(include=['object']).columns.tolist()

print("Numeric features:", numeric_features)
print("Categorical features:", categorical_features)

Numeric features: ['postal_code', 'price', 'number_of_rooms', 'living_area', 'equipped_kitchen', 'furnished', 'open_fire', 'terrace', 'garden', 'number_of_facades', 'swimming_pool', 'garden_surface', 'terrace_surface']
Categorical features: ['property_id', 'locality_name', 'type_of_property', 'subtype_of_property', 'state_of_building']


Numeric transformations
- Scaling (standardization, normalization)
- Log transforms (reduce skewness)
- Binning
- Polynomial features

Scaling pipelines:

StandardScaler and MinMaxScaler are two common preprocessing techniques in machine learning:

StandardScaler → transforms features to mean = 0 and standard deviation = 1

MinMaxScaler → scales features to a fixed range (default 0–1)

By putting them in a Pipeline, you create a reusable transformation that can be easily fit and applied to data.

In [14]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

# Standardization
standard_scaler = Pipeline([
    ("scaler", StandardScaler())
])

# Normalization
minmax_scaler = Pipeline([
    ("scaler", MinMaxScaler())
])


Pipelines keep preprocessing and modeling consistent

When you train a model, you should apply exactly the same scaling to:

training data

validation/test data

future inference data

Pipelines make this automatic:
fit learns the scaling parameters → transform applies them.

Pipelines:
ensure consistent preprocessing
prevent data leakage
integrate nicely with ML models
make experimentation easy

Log Transforms

In [15]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

log_transformer = Pipeline([
    ("log", FunctionTransformer(np.log1p, validate=False)),
    ("scaler", StandardScaler())
])


Power transformer: better for automated skew handling

In [None]:
from sklearn.preprocessing import PowerTransformer

power_transformer = Pipeline([
    ("power", PowerTransformer()), 
    ("scaler", StandardScaler())
])

Binning (Discrezation)
Useful for turning numeric values into ordered categories.

Example: bin living_area into size categories

In [17]:
from sklearn.preprocessing import KBinsDiscretizer

binning_pipe = Pipeline([
    ("binning", KBinsDiscretizer(
        n_bins=5,
        encode="ordinal",
        strategy="quantile"
    ))
])


Polynomial Features (interaction terms, squared terms)
Useful for linear models when relationships are non-linear.

Best practices:

Use degree 2 only (degree 3+ becomes explosive)

Works best for:

Linear Regression
Ridge / Lasso
SVR

Do NOT use polynomial features with tree-based models (RandomForest, XGBoost, etc.) — trees learn nonlinear shapes automatically.

In [18]:
from sklearn.preprocessing import PolynomialFeatures

poly_transformer = Pipeline([
    ("poly", PolynomialFeatures(
        degree=2,
        include_bias=False,
        interaction_only=False
    )),
    ("scaler", StandardScaler())
])

Combine all numeric transformations:

log transform → scaling → polynomial features:

In [19]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("log", FunctionTransformer(np.log1p, validate=False)),  # reduce skew
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler())
])
