# Custom Ensemble Learning from Scratch

This notebook implements a custom ensemble learning approach using
bagging (bootstrap aggregating) built explicitly from first principles.

Rather than relying on Spark’s built-in ensemble models, the ensemble
logic—including data resampling, model training, and prediction
aggregation—is implemented manually to ensure transparency and
methodological clarity.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DLD_Custom_Ensemble") \
    .getOrCreate()

In [2]:
df_ml = spark.read.parquet(
    "land_transaction_features.parquet"
)

## Train/Test Split for Ensemble Evaluation

A simple train/test split is used to compare ensemble performance against
single-model baselines. Robust evaluation is handled separately via
manual cross-validation.

In [3]:
train_df, test_df = df_ml.randomSplit([0.8, 0.2], seed=42)

## Base Learner Selection

Decision Tree Regressors are used as base learners due to their ability
to capture non-linear relationships and their sensitivity to training
data variation, which makes them well-suited for bagging.

In [4]:
from pyspark.ml.regression import DecisionTreeRegressor

def train_base_model(train_data, seed):
    return DecisionTreeRegressor(
        featuresCol="scaled_features",
        labelCol="meter_sale_price",
        maxDepth=5,
        seed=seed
    ).fit(train_data)

## Bootstrap Sampling

Bootstrap sampling is performed by sampling the training data with
replacement. Each base learner is trained on a different bootstrap
sample to promote diversity within the ensemble.

In [5]:
def bootstrap_sample(df, seed):
    return df.sample(
        withReplacement=True,
        fraction=1.0,
        seed=seed
    )

## Training the Ensemble

Multiple base models are trained independently on different bootstrap
samples of the training data.

In [6]:
num_models = 5
ensemble_models = []

for i in range(num_models):
    print(f"Training base model {i+1}")

    sample_df = bootstrap_sample(train_df, seed=42 + i)
    model = train_base_model(sample_df, seed=100 + i)

    ensemble_models.append(model)

Training base model 1
Training base model 2
Training base model 3
Training base model 4
Training base model 5


## Ensemble Prediction Aggregation

Predictions from all base learners are aggregated by averaging, producing
the final ensemble prediction.

In [7]:
from pyspark.sql.functions import col, expr

ensemble_pred_df = test_df

for i, model in enumerate(ensemble_models):
    ensemble_pred_df = (
        model
        .transform(ensemble_pred_df)
        .withColumnRenamed("prediction", f"pred_{i}")
    )

# Manually average predictions
avg_expr = " + ".join([f"pred_{i}" for i in range(num_models)])

ensemble_pred_df = ensemble_pred_df.withColumn(
    "ensemble_prediction",
    expr(f"({avg_expr}) / {num_models}")
)

## Ensemble Performance Evaluation

The ensemble model is evaluated using the same metric applied to
individual base learners to enable fair comparison.

In [8]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="meter_sale_price",
    predictionCol="ensemble_prediction",
    metricName="rmse"
)

rmse_ensemble = evaluator.evaluate(ensemble_pred_df)

print("Ensemble RMSE:", rmse_ensemble)

Ensemble RMSE: 114973.98927169091


In [9]:
single_model = train_base_model(train_df, seed=999)
single_preds = single_model.transform(test_df)

rmse_single = evaluator.evaluate(
    single_preds.withColumnRenamed("prediction", "ensemble_prediction")
)

print("Single Model RMSE:", rmse_single)

Single Model RMSE: 116741.20271981947


## Ensemble Learning Summary

The custom bagging ensemble demonstrates how combining multiple
independently trained models can improve prediction robustness compared
to a single base learner.

This ensemble is implemented entirely from scratch, including bootstrap
sampling and prediction aggregation, and serves as a practical
demonstration of ensemble learning principles in a large-scale Spark
environment.

## Linear Regression Bagging & Boosting

This section adds:
- a **baseline Linear Regression** model,
- **bagging** (bootstrap aggregation) using Linear Regression as the base learner,
- **boosting** (tree-based boosting in Spark + an optional Linear-Regression-based AdaBoost in scikit-learn for comparison).


In [10]:
# Baseline: Linear Regression (Spark ML)
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(
    featuresCol="scaled_features",
    labelCol="meter_sale_price",
    maxIter=100,
    regParam=0.0,
    elasticNetParam=0.0
)

lr_model = lr.fit(train_df)

lr_pred = lr_model.transform(test_df)
lr_pred.select("meter_sale_price", "prediction").show(5)


+----------------+------------------+
|meter_sale_price|        prediction|
+----------------+------------------+
|          110.96|34786.054928781385|
|          269.87|30251.387451888546|
|          342.69|-30302.29827854948|
|          477.36| 28271.38382512618|
|           542.9|-33993.96497656226|
+----------------+------------------+
only showing top 5 rows


In [11]:
#evaluate Linear regression

evaluator_rmse = RegressionEvaluator(
    labelCol="meter_sale_price", predictionCol="prediction", metricName="rmse"
)
evaluator_r2 = RegressionEvaluator(
    labelCol="meter_sale_price", predictionCol="prediction", metricName="r2"
)
evaluator_mae = RegressionEvaluator(
    labelCol="meter_sale_price", predictionCol="prediction", metricName="mae"
)

lr_rmse = evaluator_rmse.evaluate(lr_pred)
lr_r2 = evaluator_r2.evaluate(lr_pred)
lr_mae = evaluator_mae.evaluate(lr_pred)

print(f"Linear Regression -> RMSE: {lr_rmse:.4f}, MAE: {lr_mae:.4f}, R2: {lr_r2:.4f}")


Linear Regression -> RMSE: 1286960.9215, MAE: 63187.5461, R2: -16.3713


## Bagging with Linear Regression (Bootstrap Aggregation)

We train multiple Linear Regression models on bootstrap samples of the training set and **average** their predictions.


In [12]:
from pyspark.ml.regression import LinearRegression

def train_lr_base_model(train_data):
    return LinearRegression(
        featuresCol="scaled_features",
        labelCol="meter_sale_price",
        maxIter=100,
        regParam=0.0,
        elasticNetParam=0.0
    ).fit(train_data)

num_lr_models = 10
lr_ensemble_models = []

for i in range(num_lr_models):
    print(f"Training LR bagging model {i+1}/{num_lr_models}")
    sample_df = bootstrap_sample(train_df, seed=900 + i)
    lr_ensemble_models.append(train_lr_base_model(sample_df))


Training LR bagging model 1/10
Training LR bagging model 2/10
Training LR bagging model 3/10
Training LR bagging model 4/10
Training LR bagging model 5/10
Training LR bagging model 6/10
Training LR bagging model 7/10
Training LR bagging model 8/10
Training LR bagging model 9/10
Training LR bagging model 10/10


In [13]:
#aggregate bagged LR predictions by averaging
from pyspark.sql.functions import expr

lr_ensemble_pred_df = test_df

for i, model in enumerate(lr_ensemble_models):
    lr_ensemble_pred_df = (
        model
        .transform(lr_ensemble_pred_df)
        .withColumnRenamed("prediction", f"lr_pred_{i}")
    )

avg_expr = " + ".join([f"lr_pred_{i}" for i in range(num_lr_models)])

lr_ensemble_pred_df = lr_ensemble_pred_df.withColumn(
    "lr_bagging_prediction",
    expr(f"({avg_expr}) / {num_lr_models}")
)

lr_ensemble_pred_df.select("meter_sale_price", "lr_bagging_prediction").show(5)


+----------------+---------------------+
|meter_sale_price|lr_bagging_prediction|
+----------------+---------------------+
|          110.96| 1.669437254714057...|
|          269.87| 1.562890953849067...|
|          342.69| -1.27305689174912...|
|          477.36| 1.516368893773190...|
|           542.9| -1.11753466628267...|
+----------------+---------------------+
only showing top 5 rows


In [14]:
#evaluate bagged Linear Regression
bag_rmse = RegressionEvaluator(
    labelCol="meter_sale_price", predictionCol="lr_bagging_prediction", metricName="rmse"
).evaluate(lr_ensemble_pred_df)

bag_r2 = RegressionEvaluator(
    labelCol="meter_sale_price", predictionCol="lr_bagging_prediction", metricName="r2"
).evaluate(lr_ensemble_pred_df)

bag_mae = RegressionEvaluator(
    labelCol="meter_sale_price", predictionCol="lr_bagging_prediction", metricName="mae"
).evaluate(lr_ensemble_pred_df)

print(f"Bagging (Linear Regression) -> RMSE: {bag_rmse:.4f}, MAE: {bag_mae:.4f}, R2: {bag_r2:.4f}")


Bagging (Linear Regression) -> RMSE: 61159678579050176512.0000, MAE: 1163866327838267904.0000, R2: -39231306255768049367254040576.0000


## Boosting

###spark ML: Gradient Boosted Trees Regressor
Spark’s built in boosting for regression is implemented as Gradient Boosted Trees.



In [15]:
# A) Boosting in Spark: Gradient-Boosted Trees Regressor
from pyspark.ml.regression import GBTRegressor

gbt = GBTRegressor(
    featuresCol="scaled_features",
    labelCol="meter_sale_price",
    maxDepth=5,
    maxIter=100,
    stepSize=0.1,
    seed=42
)

gbt_model = gbt.fit(train_df)
gbt_pred = gbt_model.transform(test_df)

gbt_rmse = evaluator_rmse.evaluate(gbt_pred)
gbt_r2 = evaluator_r2.evaluate(gbt_pred)
gbt_mae = evaluator_mae.evaluate(gbt_pred)

print(f"GBT Boosting (Spark) -> RMSE: {gbt_rmse:.4f}, MAE: {gbt_mae:.4f}, R2: {gbt_r2:.4f}")


GBT Boosting (Spark) -> RMSE: 116732.9913, MAE: 7318.3339, R2: 0.8571


## Quick Comparison Table (Results)

After running the cells above, you can compare:
- Linear Regression (baseline)
- Bagging (Linear Regression)
- Boosting (GBT in Spark)

Typically:
- Bagging mainly reduces variance (often helps unstable models linear regression is already stable, so gains can be modest).
- Boosting can reduce bias and capture nonlinear patterns (GBT often improves performance when relationships are nonlinear).


In [18]:
import pandas as pd

ensemble_df = pd.DataFrame({
    "model": ["BaggingEnsemble"],
    "rmse": [rmse_ensemble]
})

ensemble_df.to_csv(
    "ensemble_metrics.csv",
    index=False
)