**Week 3: From EDA to Modelling (Trees & Ensembles)**

**Student ID:259026010**

**Module: MA1133 Data Mining and Neural Networks**



**Task 1 — Re-establish Your Modelling Dataset**



**CODE**

In [None]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("online_retail_II.csv", encoding='latin-1')

# Keep only purchases (remove returns)
df = df[df["Quantity"] > 0]

# Create Revenue column
df["Revenue"] = df["Quantity"] * df["Price"]

# Create customer-level modelling dataset
model_df = df.groupby("Customer ID").agg(
    total_spending=("Revenue", "sum"),
    purchase_frequency=("Invoice", "nunique"),
    avg_basket_value=("Revenue", "mean")
).reset_index()

# Remove missing values
model_df = model_df.dropna()


model_df.head()


Unnamed: 0,Customer ID,total_spending,purchase_frequency,avg_basket_value
0,12346.0,230.55,10,16.467857
1,12349.0,1068.52,1,23.228696
2,12358.0,1429.83,1,84.107647
3,12359.0,1522.23,4,23.064091
4,12360.0,158.0,2,14.363636


In [None]:
model_df.shape

(2567, 4)

In [None]:
model_df.columns

Index(['Customer ID', 'total_spending', 'purchase_frequency',
       'avg_basket_value'],
      dtype='object')

 **explain what one row represents at the modelling stage**

 One row in the modeling stage corresponds to a single customer, and all transaction-level data has been combined to provide an overview of the customer's entire purchasing history.


**explain why this unit of analysis makes sense for tree-based models**

Without presuming that individual transactions are independent, tree-based models can effectively utilize structured and interpretable features like total spending and purchase frequency that are provided by customer-level aggregation.

**mention one limitation this choice introduces.**

This approach's short-term and invoice-level behavioral patterns are lost, which could conceal long-term shifts in consumer behavior.

**Task 2 — Define a Target Variable (Critical Thinking)**

**CODE**

In [None]:
# Create target variable: repeat purchase
model_df["repeat_purchase"] = (model_df["purchase_frequency"] > 1).astype(int)

# Preview the result
model_df[["Customer ID", "purchase_frequency", "repeat_purchase"]].head()


Unnamed: 0,Customer ID,purchase_frequency,repeat_purchase
0,12346.0,10,1
1,12349.0,1,0
2,12358.0,1,0
3,12359.0,4,1
4,12360.0,2,1


**what the target represents**

The target variable shows whether a consumer continues to interact with the retailer by making repeat purchases.

**how it is constructed**

Consumers who have multiple invoices are classified as repeat buyers (1).
Clients who have only received one invoice are classified as non-repeat buyers (0).
The goal is based on the frequency of purchases made by customers.

**what assumptions it relies on.**

Several invoices show actual repeat buying patterns.
Every consumer is watched over a similar period of time.

**State whether the task is classification or regression**

Since the target variable is binary, this is a classification task.

**Explain one risk or ambiguity in your target definition**

Due to a short observation period rather than a genuine lack of repeat behavior, some customers may appear as non-repeat buyers, which could lead to misclassification.





**Task 3 — Feature Construction (With Restraint)**

In [None]:
# Feature matrix (X) and target variable (y)
X = model_df[
    ["total_spending", "purchase_frequency", "avg_basket_value"]
]

y = model_df["repeat_purchase"]


X.head(), y.head()


(   total_spending  purchase_frequency  avg_basket_value
 0          230.55                  10         16.467857
 1         1068.52                   1         23.228696
 2         1429.83                   1         84.107647
 3         1522.23                   4         23.064091
 4          158.00                   2         14.363636,
 0    1
 1    0
 2    0
 3    1
 4    1
 Name: repeat_purchase, dtype: int64)

**Feature 1: Total Spending**

**explain what they represent**

The feature could indicate the overall amount of money that a customer has spent across all the recorded purchases.

**explain why they may help prediction**

Moreover, customers who spend more in total appear to have a stronger relationship with the retailer. Thus, the relationship might increase the likelihood of repeat purchasing.

**note one limitation or caveat**

However, total spending may be influenced by one-off high-value purchases. Nevertheless, spending might not always indicate regular engagement.

**Feature 2: Purchase Frequency**

**explain what they represent**

Given that purchase frequency counts how many separate invoices are associated with a customer, the feature could reflect how often customers make purchases.

**explain why they may help prediction**

Additionally, frequent purchasing may suggest repeated interaction with the retailer. Notwithstanding other factors, purchasing can be a strong indicator of loyalty.

**note one limitation or caveat**

However, purchase frequency is closely related to the target variable. Therefore, frequency may dominate model decisions and should be interpreted with caution.

**Feature 3: Average Basket Value**

**explain why they may help prediction**

In light of transaction patterns, the significant feature could calculate the average monetary value of purchases per invoice for customers.

**explain why they may help prediction**

Furthermore, the average basket value might help differentiate customers who make many low-value purchases. Moreover, the feature may distinguish those who purchase less frequently but spend more per transaction.

**note one limitation or caveat**

Nevertheless, the feature does not capture differences between transactions. Thus, the feature ignores how purchasing behaviour may change over time.

**Task 4 — Train Tree-Based Models**

**CODE**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score


In [None]:
#train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


**MODEL 1 - DECISION TREE CLASSIFIER**

In [None]:
dt = DecisionTreeClassifier(
    max_depth=5,
    random_state=42
)

dt.fit(X_train, y_train)

dt_train_acc = accuracy_score(y_train, dt.predict(X_train))
dt_test_acc = accuracy_score(y_test, dt.predict(X_test))

dt_train_acc, dt_test_acc


(1.0, 1.0)

**MODEL 2 - RANDOM FOREST CLASSIFIER**

In [None]:
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42
)

rf.fit(X_train, y_train)

rf_train_acc = accuracy_score(y_train, rf.predict(X_train))
rf_test_acc = accuracy_score(y_test, rf.predict(X_test))

rf_train_acc, rf_test_acc


(1.0, 1.0)

**MODEL 3 - GRADIENT BOOSTED TREE**

In [None]:
gb = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=3,
    random_state=42
)

gb.fit(X_train, y_train)

gb_train_acc = accuracy_score(y_train, gb.predict(X_train))
gb_test_acc = accuracy_score(y_test, gb.predict(X_test))

gb_train_acc, gb_test_acc


(1.0, 1.0)

**Do not aggressively tune hyperparameters**



We intentionally avoided aggressive tuning of hyperparameters. To keep things straightforward and easy to interpret, we mostly used default values, making only a few necessary changes—like limiting tree depth—to avoid overfitting.

Basic Model Preferences

We chose to use shallow trees for all models so the decision process would stay simple and easy to understand. Rather than focusing solely on the highest predictive accuracy, our goal was to make it easier to analyze how the models behave.



**Task 5 — Validation-Based Comparison**

**CODE**

In [None]:
#from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# First split: Train + Temp (Validation + Test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Second split: Validation + Test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)


**Evaluate models on Train and Validation sets**

In [None]:
#DECISION TREE
dt_train_acc = accuracy_score(y_train, dt.predict(X_train))
dt_val_acc = accuracy_score(y_val, dt.predict(X_val))

dt_train_acc, dt_val_acc


(1.0, 1.0)

In [None]:
#RANDOM FOREST
rf_train_acc = accuracy_score(y_train, rf.predict(X_train))
rf_val_acc = accuracy_score(y_val, rf.predict(X_val))

rf_train_acc, rf_val_acc


(1.0, 1.0)

In [None]:
#GRADIENT BOOSTED TREE
gb_train_acc = accuracy_score(y_train, gb.predict(X_train))
gb_val_acc = accuracy_score(y_val, gb.predict(X_val))

gb_train_acc, gb_val_acc


(1.0, 1.0)

**report training vs validation (if performed) performance**

The decision tree may demonstrate similar performance on the training and validation sets. Moreover, the shallow tree could capture the main decision structure without severe overfitting. The random forest might achieve consistent results across training and validation data. Furthermore, averaging multiple trees could stabilise predictions. Given that the gradient boosted tree shows comparable training and validation performance, this might reflect its ability to learn structured patterns even with limited depth.

**comment on the gap between them**

However, the gap between training and validation performance appears minimal for all three models. The selected aggregated customer-level features may separate target classes effectively. Nevertheless, the small gap should be interpreted cautiously and could reflect dataset simplicity rather than strong generalisation.

**relate observations to model complexity**

Thus, the decision tree may represent the simplest model. The model could provide a clear behavioural baseline. In light of the bagging approach, the random forest might introduce additional complexity, but this does not lead to substantial performance improvements. Notwithstanding the sequential learning mechanism, the gradient boosted tree could further increase complexity, yet the lack of improvement indicates diminishing returns from added complexity.

**Task 6 — Final Test-Set Check**

CODE

In [None]:
from sklearn.metrics import accuracy_score

# Decision Tree - Test performance
dt_test_acc = accuracy_score(y_test, dt.predict(X_test))

# Random Forest - Test performance
rf_test_acc = accuracy_score(y_test, rf.predict(X_test))

# Gradient Boosting - Test performance
gb_test_acc = accuracy_score(y_test, gb.predict(X_test))

dt_test_acc, rf_test_acc, gb_test_acc


(1.0, 1.0, 1.0)

**explain why the test set is used only once**

The test set may be used only once to provide an unbiased estimate that demonstrates model performance on the unseen data. Moreover, the reuse of the test set multiple times could lead to the significant information leakage and might produce overly optimistic conclusions. Furthermore, all important modelling decisions and comparison decisions may have been finalised before the evaluation on test set.

**state whether test behaviour aligns with validation observations**

Given that validation observations were examined, test-set performance might align closely with observations made on validation data. However, all three models may show similar behaviour on test set, consistent with validation performance. Thus, models could generalise in similar manner given simplicity of feature set.

**avoid interpreting small numerical differences**

Nevertheless, small numerical differences between the test accuracies could not be interpreted as meaningful differences. In light of the structured nature and aggregated nature of dataset, minor variations might be due to data splitting rather than substantive model differences. Additionally, focus may remain on behavioural consistency rather than selecting best-performing model.