# Manual 10-Fold Cross-Validation

This notebook implements a manual 10-fold cross-validation procedure
for evaluating machine learning models on the Dubai Land Transactions
dataset using Apache Spark.

Library-provided cross-validation utilities are intentionally not used.
All data splitting, model training, and evaluation steps are implemented
explicitly to ensure transparency and methodological clarity.

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DLD_Manual_10Fold_CV") \
    .getOrCreate()

In [3]:
df_ml = spark.read.parquet(
    "land_transactions_features.parquet"
)

## Dataset Integrity Verification

A final integrity check is performed to ensure the feature-engineered
dataset is correctly loaded before cross-validation.

In [5]:
print("Rows:", df_ml.count())
print("Columns:", len(df_ml.columns))

Rows: 30173
Columns: 31


## Manual Fold Assignment

Each record is assigned to one of ten folds using a deterministic,
randomized strategy. This fold assignment is used to explicitly control
training and validation splits during cross-validation.

In [6]:
from pyspark.sql.functions import rand, floor

df_folds = df_ml.withColumn(
    "fold_id",
    floor(rand(seed=42) * 10)
)

## Fold Distribution Check

The distribution of records across folds is examined to ensure that
each fold contains a comparable number of observations.

In [7]:
df_folds.groupBy("fold_id").count().orderBy("fold_id").show()

+-------+-----+
|fold_id|count|
+-------+-----+
|      0| 3077|
|      1| 3081|
|      2| 2996|
|      3| 2977|
|      4| 3070|
|      5| 3060|
|      6| 2968|
|      7| 3014|
|      8| 2952|
|      9| 2978|
+-------+-----+



In [8]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="meter_sale_price",
    predictionCol="prediction",
    metricName="rmse"
)

## Manual Cross-Validation Loop

For each fold, a model is trained on nine folds and evaluated on the
remaining fold. Performance metrics are recorded explicitly for each
iteration.

In [9]:
from pyspark.ml.regression import LinearRegression

rmse_scores = []

for fold in range(10):
    print(f"Processing fold {fold}")

    train_df = df_folds.filter(df_folds.fold_id != fold)
    val_df = df_folds.filter(df_folds.fold_id == fold)

    lr = LinearRegression(
        featuresCol="scaled_features",
        labelCol="meter_sale_price"
    )

    model = lr.fit(train_df)
    predictions = model.transform(val_df)

    rmse = evaluator.evaluate(predictions)
    rmse_scores.append(rmse)

    print(f"Fold {fold} RMSE: {rmse}")

Processing fold 0
Fold 0 RMSE: 275504.41280093306
Processing fold 1
Fold 1 RMSE: 213642.25443473406
Processing fold 2
Fold 2 RMSE: 279116.68370026455
Processing fold 3
Fold 3 RMSE: 901629.0875254313
Processing fold 4
Fold 4 RMSE: 374997.6858570821
Processing fold 5
Fold 5 RMSE: 278162.6350803256
Processing fold 6
Fold 6 RMSE: 1115793.910499437
Processing fold 7
Fold 7 RMSE: 237867.57149557606
Processing fold 8
Fold 8 RMSE: 334579.28443165525
Processing fold 9
Fold 9 RMSE: 901207.7796669098


## Cross-Validation Performance Summary

The performance metrics obtained across all folds are aggregated to
estimate the modelâ€™s generalization performance.

In [10]:
import numpy as np

rmse_mean = np.mean(rmse_scores)
rmse_std = np.std(rmse_scores)

print("Mean RMSE:", rmse_mean)
print("RMSE Standard Deviation:", rmse_std)

Mean RMSE: 491250.1305492349
RMSE Standard Deviation: 322926.2190084448


## Manual Cross-Validation: Decision Tree Regressor

In [11]:
from pyspark.ml.regression import DecisionTreeRegressor

rmse_dt_scores = []

for fold in range(10):
    train_df = df_folds.filter(df_folds.fold_id != fold)
    val_df = df_folds.filter(df_folds.fold_id == fold)

    dt = DecisionTreeRegressor(
        featuresCol="scaled_features",
        labelCol="meter_sale_price",
        maxDepth=5,
        seed=42
    )

    model = dt.fit(train_df)
    predictions = model.transform(val_df)

    rmse = evaluator.evaluate(predictions)
    rmse_dt_scores.append(rmse)

In [12]:
import numpy as np

print("Decision Tree CV Mean RMSE:", np.mean(rmse_dt_scores))
print("Decision Tree CV RMSE Std:", np.std(rmse_dt_scores))

Decision Tree CV Mean RMSE: 97425.79882141565
Decision Tree CV RMSE Std: 38291.339881568354


## Manual Cross-Validation: Linear Regression

Linear Regression is evaluated using the same manual 10-fold
cross-validation procedure to provide a consistent baseline for
comparison with non-linear models and ensemble methods.

In [13]:
from pyspark.ml.regression import LinearRegression

rmse_lr_scores = []

for fold in range(10):
    print(f"Processing fold {fold}")

    train_df = df_folds.filter(df_folds.fold_id != fold)
    val_df = df_folds.filter(df_folds.fold_id == fold)

    lr = LinearRegression(
        featuresCol="scaled_features",
        labelCol="meter_sale_price"
    )

    model = lr.fit(train_df)
    predictions = model.transform(val_df)

    rmse = evaluator.evaluate(predictions)
    rmse_lr_scores.append(rmse)

    print(f"Fold {fold} RMSE: {rmse}")

Processing fold 0
Fold 0 RMSE: 275504.41280093306
Processing fold 1
Fold 1 RMSE: 213642.25443473406
Processing fold 2
Fold 2 RMSE: 279116.68370026455
Processing fold 3
Fold 3 RMSE: 901629.0875254313
Processing fold 4
Fold 4 RMSE: 374997.6858570821
Processing fold 5
Fold 5 RMSE: 278162.6350803256
Processing fold 6
Fold 6 RMSE: 1115793.910499437
Processing fold 7
Fold 7 RMSE: 237867.57149557606
Processing fold 8
Fold 8 RMSE: 334579.28443165525
Processing fold 9
Fold 9 RMSE: 901207.7796669098


In [14]:
import numpy as np

print("Linear Regression CV Mean RMSE:", np.mean(rmse_lr_scores))
print("Linear Regression CV RMSE Std:", np.std(rmse_lr_scores))

Linear Regression CV Mean RMSE: 491250.1305492349
Linear Regression CV RMSE Std: 322926.2190084448


## Manual Cross-Validation Summary

Manual 10-fold cross-validation provides a robust estimate of model
performance and variability across different data partitions.

These results serve as the primary evaluation benchmark and form the
basis for comparison with custom ensemble models developed in the
subsequent notebook.

In [15]:
import pandas as pd
from pathlib import Path

Path("data/outputs/metrics").mkdir(parents=True, exist_ok=True)

cv_df = pd.DataFrame({
    "fold": list(range(10)),
    "lr_rmse": rmse_lr_scores,
    "dt_rmse": rmse_dt_scores
})

cv_df.to_csv(
    "data/outputs/metrics/cv_results.csv",
    index=False
)