# Custom Ensemble Learning from Scratch

This notebook implements a custom ensemble learning approach using
bagging (bootstrap aggregating) built explicitly from first principles.

Rather than relying on Spark’s built-in ensemble models, the ensemble
logic—including data resampling, model training, and prediction
aggregation—is implemented manually to ensure transparency and
methodological clarity.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DLD_Custom_Ensemble") \
    .getOrCreate()

In [5]:
df_ml = spark.read.parquet(
    "land_transactions_features.parquet"
)

## Train/Test Split for Ensemble Evaluation

A simple train/test split is used to compare ensemble performance against
single-model baselines. Robust evaluation is handled separately via
manual cross-validation.

In [7]:
train_df, test_df = df_ml.randomSplit([0.8, 0.2], seed=42)

## Base Learner Selection

Decision Tree Regressors are used as base learners due to their ability
to capture non-linear relationships and their sensitivity to training
data variation, which makes them well-suited for bagging.

In [8]:
from pyspark.ml.regression import DecisionTreeRegressor

def train_base_model(train_data, seed):
    return DecisionTreeRegressor(
        featuresCol="scaled_features",
        labelCol="meter_sale_price",
        maxDepth=5,
        seed=seed
    ).fit(train_data)

## Bootstrap Sampling

Bootstrap sampling is performed by sampling the training data with
replacement. Each base learner is trained on a different bootstrap
sample to promote diversity within the ensemble.

In [9]:
def bootstrap_sample(df, seed):
    return df.sample(
        withReplacement=True,
        fraction=1.0,
        seed=seed
    )

## Training the Ensemble

Multiple base models are trained independently on different bootstrap
samples of the training data.

In [20]:
num_models = 5
ensemble_models = []

for i in range(num_models):
    print(f"Training base model {i+1}")

    sample_df = bootstrap_sample(train_df, seed=42 + i)
    model = train_base_model(sample_df, seed=100 + i)

    ensemble_models.append(model)

Training base model 1
Training base model 2
Training base model 3
Training base model 4
Training base model 5


## Ensemble Prediction Aggregation

Predictions from all base learners are aggregated by averaging, producing
the final ensemble prediction.

In [27]:
from pyspark.sql.functions import col, expr

ensemble_pred_df = test_df

for i, model in enumerate(ensemble_models):
    ensemble_pred_df = (
        model
        .transform(ensemble_pred_df)
        .withColumnRenamed("prediction", f"pred_{i}")
    )

# Manually average predictions
avg_expr = " + ".join([f"pred_{i}" for i in range(num_models)])

ensemble_pred_df = ensemble_pred_df.withColumn(
    "ensemble_prediction",
    expr(f"({avg_expr}) / {num_models}")
)

## Ensemble Performance Evaluation

The ensemble model is evaluated using the same metric applied to
individual base learners to enable fair comparison.

In [28]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="meter_sale_price",
    predictionCol="ensemble_prediction",
    metricName="rmse"
)

rmse_ensemble = evaluator.evaluate(ensemble_pred_df)

print("Ensemble RMSE:", rmse_ensemble)

Ensemble RMSE: 114973.98927169091


In [29]:
single_model = train_base_model(train_df, seed=999)
single_preds = single_model.transform(test_df)

rmse_single = evaluator.evaluate(
    single_preds.withColumnRenamed("prediction", "ensemble_prediction")
)

print("Single Model RMSE:", rmse_single)

Single Model RMSE: 116741.20271981947


## Ensemble Learning Summary

The custom bagging ensemble demonstrates how combining multiple
independently trained models can improve prediction robustness compared
to a single base learner.

This ensemble is implemented entirely from scratch, including bootstrap
sampling and prediction aggregation, and serves as a practical
demonstration of ensemble learning principles in a large-scale Spark
environment.