# Feature Engineering for Large-Scale Modeling

This notebook performs feature engineering on the cleaned Dubai Land
Transactions dataset using Apache Spark.

The objective of this notebook is to transform raw transactional and
categorical attributes into a numerical, model-ready feature vector
while preserving scalability and reproducibility.

No machine learning models are trained in this notebook. The output is
a feature-engineered dataset that serves as the sole input for all
subsequent modeling, cross-validation, and ensemble learning steps.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DLD_Feature_Engineering") \
    .getOrCreate()

In [None]:
df = spark.read.parquet(
    "/content/cleaned_parquet"
)

## Feature Selection

Only attributes relevant to transaction value, property characteristics,
spatial context, and transaction structure are retained. Identifier
fields and free-text attributes that do not contribute to modeling
are excluded.

In [9]:
selected_cols = [
    "meter_sale_price",          # target
    "procedure_area",
    "has_parking",
    "no_of_parties_role_1",
    "no_of_parties_role_2",
    "no_of_parties_role_3",
    "property_type_en",
    "property_sub_type_en",
    "property_usage_en",
    "area_name_en",
    "nearest_metro_en",
    "nearest_mall_en",
    "instance_date"
]

df = df.select(*selected_cols)

## Handling Missing Values

Remaining missing values are handled at the feature level to ensure
compatibility with Spark ML transformers and estimators.

In [10]:
df = df.fillna({
    "no_of_parties_role_1": 0,
    "no_of_parties_role_2": 0,
    "no_of_parties_role_3": 0,
    "has_parking": 0
})

## Temporal Feature Extraction

Temporal attributes are derived from the transaction registration date
to capture seasonal and longitudinal effects in transaction behavior.


In [17]:
from pyspark.sql.functions import to_date, year, month, quarter, col

# Explicitly parse the date string using the correct format
df = df.withColumn(
    "instance_date_parsed",
    to_date(col("instance_date"), "dd-MM-yyyy")
)

# Extract temporal features from the parsed date
df = df.withColumn("transaction_year", year("instance_date_parsed")) \
       .withColumn("transaction_month", month("instance_date_parsed")) \
       .withColumn("transaction_quarter", quarter("instance_date_parsed"))

## Encoding Categorical Features

Categorical attributes are converted into numerical representations
using Spark ML encoders to enable large-scale modeling.

In [12]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

categorical_cols = [
    "property_type_en",
    "property_sub_type_en",
    "property_usage_en",
    "area_name_en",
    "nearest_metro_en",
    "nearest_mall_en"
]

indexers = [
    StringIndexer(
        inputCol=col,
        outputCol=f"{col}_idx",
        handleInvalid="keep"
    )
    for col in categorical_cols
]

encoders = [
    OneHotEncoder(
        inputCol=f"{col}_idx",
        outputCol=f"{col}_ohe"
    )
    for col in categorical_cols
]

## Feature Vector Assembly

Numerical and encoded categorical features are combined into a single
feature vector required by Spark ML models.

In [13]:
from pyspark.ml.feature import VectorAssembler

numeric_cols = [
    "procedure_area",
    "has_parking",
    "no_of_parties_role_1",
    "no_of_parties_role_2",
    "no_of_parties_role_3",
    "transaction_year",
    "transaction_month",
    "transaction_quarter"
]

assembler_inputs = numeric_cols + [f"{c}_ohe" for c in categorical_cols]

assembler = VectorAssembler(
    inputCols=assembler_inputs,
    outputCol="features"
)

## Feature Scaling Using StandardScaler

To prevent numerical features with large magnitudes from dominating the
learning process, feature scaling is applied using Spark MLâ€™s
StandardScaler.

In [14]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(
    inputCol="features",
    outputCol="scaled_features",
    withMean=True,
    withStd=True
)

In [18]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=indexers + encoders + [assembler, scaler])
feature_model = pipeline.fit(df)
df_features = feature_model.transform(df)

## Feature-Engineered Dataset Verification

A final integrity check is performed to confirm that the feature vector
has been successfully constructed and is ready for modeling.


In [19]:
print("Rows:", df_features.count())
print("Columns:", len(df_features.columns))
df_features.select("scaled_features", "meter_sale_price").show(5, truncate=False)

Rows: 30173
Columns: 31
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [20]:
df_features.write \
    .mode("overwrite") \
    .parquet("land_transactions_features.parquet")