# Baseline Machine Learning Models

This notebook establishes baseline machine learning performance on the
feature-engineered Dubai Land Transactions dataset using Spark ML.

The purpose of this notebook is to:
- Train standard baseline models using a simple train/test split
- Obtain reference performance metrics
- Provide comparison points for manual cross-validation and ensemble models

All feature engineering, encoding, and scaling steps are assumed to have
been completed prior to this notebook.

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DLD_Baseline_Models") \
    .getOrCreate()

In [7]:
df_ml = spark.read.parquet(
    "land_transactions_features.parquet"
)

## Dataset Integrity Verification Before Modeling

Before applying baseline models, a final integrity check is performed to
ensure that the feature-engineered dataset is complete and structurally
consistent.

In [5]:
print("Rows:", df_ml.count())
print("Columns:", len(df_ml.columns))
df_ml.select("scaled_features", "meter_sale_price").show(5, truncate=False)

Rows: 30173
Columns: 31
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Train/Test Split

A simple train/test split is used to establish baseline model performance.
This split is used only for initial benchmarking. Robust evaluation is
performed later using manual 10-fold cross-validation.

In [8]:
train_df, test_df = df_ml.randomSplit([0.8, 0.2], seed=42)

## Baseline Model 1: Linear Regression

Linear Regression is used as a simple and interpretable baseline to model
the relationship between property attributes and transaction value.

In [9]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(
    featuresCol="scaled_features",
    labelCol="meter_sale_price"
)

lr_model = lr.fit(train_df)
lr_predictions = lr_model.transform(test_df)

## Evaluation Metrics: Linear Regression

In [10]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="meter_sale_price",
    predictionCol="prediction",
    metricName="rmse"
)

rmse_lr = evaluator.evaluate(lr_predictions)
print("Linear Regression RMSE:", rmse_lr)

Linear Regression RMSE: 1286960.9214669454


## Baseline Model 2: Decision Tree Regressor

A Decision Tree Regressor is used to capture non-linear relationships
between features and transaction values.

In [11]:
from pyspark.ml.regression import DecisionTreeRegressor

dt = DecisionTreeRegressor(
    featuresCol="scaled_features",
    labelCol="meter_sale_price",
    maxDepth=5,
    seed=42
)

dt_model = dt.fit(train_df)
dt_predictions = dt_model.transform(test_df)

rmse_dt = evaluator.evaluate(dt_predictions)
print("Decision Tree RMSE:", rmse_dt)

Decision Tree RMSE: 116741.23119833933


## Baseline Unsupervised Model: K-Means Clustering

In addition to supervised prediction, clustering is applied to identify
latent structure in transaction behavior that may not be captured by
regression models.

In [12]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans(
    k=5,
    seed=42,
    featuresCol="scaled_features"
)

kmeans_model = kmeans.fit(df_ml)
clustered_df = kmeans_model.transform(df_ml)

clustering_evaluator = ClusteringEvaluator(
    featuresCol="scaled_features"
)

silhouette = clustering_evaluator.evaluate(clustered_df)
print("Silhouette Score:", silhouette)

Silhouette Score: 0.4366926210427686


## Baseline Model Summary

The baseline models provide initial reference performance using a simple
train/test split. These results are not used for final model selection,
but instead serve as comparison points for:

- Manual 10-fold cross-validation
- Custom ensemble learning models

Subsequent notebooks focus on improving robustness and generalization
through more rigorous evaluation strategies.

In [14]:
import pandas as pd
from pathlib import Path

Path("data/outputs/metrics").mkdir(parents=True, exist_ok=True)

baseline_df = pd.DataFrame({
    "model": ["LinearRegression", "DecisionTree"],
    "rmse": [rmse_lr, rmse_dt]
})

baseline_df.to_csv(
    "data/outputs/metrics/baseline_metrics.csv",
    index=False
)