## TASK 1: Modelling Dataset and Unit of Analysis

Each row in the modelling dataset represents **one customer**, aggregated across
all their transactions in the Online Retail data.

This customer-level unit of analysis is appropriate for tree-based models because
customer behaviour can be represented through non-linear interactions between
spending, purchase frequency, and quantity purchased — relationships that decision
trees and ensembles can capture without requiring scaling.

One limitation is that aggregation removes temporal order (recency/sequence),
so customers with different purchase patterns may look similar once summarised.


In [2]:
#Importing pandas library
import pandas as pd

# 1. Load the dataset
df = pd.read_csv('../data/online_retail.csv', encoding='ISO-8859-1')

# 2. Display the first few rows
print(df.head())

# 3. Display number of rows and columns
print(f"Dataset Shape: {df.shape}")

# 4. Display column names and basic data types
print(df.info())

  Invoice StockCode                          Description  Quantity  \
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   
1  489434    79323P                   PINK CHERRY LIGHTS        12   
2  489434    79323W                  WHITE CHERRY LIGHTS        12   
3  489434     22041         RECORD FRAME 7" SINGLE SIZE         48   
4  489434     21232       STRAWBERRY CERAMIC TRINKET BOX        24   

        InvoiceDate  Price  Customer ID         Country  
0  01/12/2009 07:45   6.95      13085.0  United Kingdom  
1  01/12/2009 07:45   6.75      13085.0  United Kingdom  
2  01/12/2009 07:45   6.75      13085.0  United Kingdom  
3  01/12/2009 07:45   2.10      13085.0  United Kingdom  
4  01/12/2009 07:45   1.25      13085.0  United Kingdom  
Dataset Shape: (525461, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Inv

In [5]:
# First, we create a 'TotalSpend' column (Quantity * Price) 
# so we can see the total money spent per row.
df['TotalAmount'] = df['Quantity'] * df['Price']

# Now we aggregate the data to the Customer Level.
# We want to know: How many trips (Invoices) did they make? 
# How many items (Quantity) did they buy? 
# And what is their total lifetime spend (TotalAmount)?

customer_level_df = df.groupby('Customer ID').agg({
    'Invoice': 'nunique',     # Count unique invoice numbers (Frequency of visits)
    'Quantity': 'sum',        # Total number of products bought
    'TotalAmount': 'sum'      # Total revenue generated by this customer
}).reset_index()

# Rename columns to be more descriptive for the new unit of analysis
customer_level_df.columns = ['CustomerID', 'TotalVisits', 'TotalItemsBought', 'LifetimeSpend']

print("--- Data aggregated to the CUSTOMER level ---")
print(f"Original rows: {len(df)}")
print(f"New rows (one per customer): {len(customer_level_df)}")
print("\nFirst few rows of the new Customer-Level dataframe:")
display(customer_level_df.head())

--- Data aggregated to the CUSTOMER level ---
Original rows: 525461
New rows (one per customer): 4383

First few rows of the new Customer-Level dataframe:


Unnamed: 0,CustomerID,TotalVisits,TotalItemsBought,LifetimeSpend
0,12346.0,15,52,-64.68
1,12347.0,2,828,1323.32
2,12348.0,1,373,222.16
3,12349.0,4,988,2646.99
4,12351.0,1,261,300.93


In [6]:
# Before counts
rows_before = len(df)

# Cleaning Step 1: Remove rows with missing Customer ID
df_clean = df.dropna(subset=['Customer ID']).copy()

# Cleaning Step 2: Remove Returns (Negative Quantity) and Errors (Zero Price)
df_clean = df_clean[(df_clean['Quantity'] > 0) & (df_clean['Price'] > 0)]

# After counts
rows_after = len(df_clean)
removed = rows_before - rows_after

print(f"Total rows before: {rows_before}")
print(f"Total rows after:  {rows_after}")
print(f"Total rows removed: {removed} ({round(removed/rows_before*100, 2)}%)")

# EVIDENCE REQUIREMENT: Show that negative spenders are gone
# Re-aggregating for the audit check
check_customer = df_clean.groupby('Customer ID').agg({'Quantity':'sum', 'Price':'sum'}).reset_index()
print(f"Customers with negative spend remaining: {len(check_customer[check_customer['Price'] < 0])}")

Total rows before: 525461
Total rows after:  407664
Total rows removed: 117797 (22.42%)
Customers with negative spend remaining: 0


In [7]:
# TASK 1 CODE: Build a customer-level modelling dataset
# Assumes df_clean already exists after cleaning

# Ensure we continue with cleaned data only
df = df_clean.copy()  # use cleaned data from here onwards

# Aggregate transactional data to one row per customer
customer_df = (
    df.groupby("Customer ID")
      .agg(
          TotalSpend=("TotalAmount", "sum"),      # total money spent by customer
          NumTransactions=("Invoice", "nunique"), # number of unique invoices (purchase frequency)
          TotalQuantity=("Quantity", "sum")       # total items purchased (volume)
      )
      .reset_index()
)

# Quick sanity check
print("Customer-level dataset shape:", customer_df.shape)
customer_df.head()


Customer-level dataset shape: (4312, 4)


Unnamed: 0,Customer ID,TotalSpend,NumTransactions,TotalQuantity
0,12346.0,372.86,11,70
1,12347.0,1323.32,2,828
2,12348.0,222.16,1,373
3,12349.0,2671.14,3,993
4,12351.0,300.93,1,261


## TASK 2: Target Variable Definition

**Target:** `high_value_customer` (binary)

A customer is labelled as **high value (1)** if their total spending (`TotalSpend`)
is above the median spending across all customers; otherwise they are **0**.

This is a **classification problem** because the outcome is categorical (0/1).

Assumptions:
- The median is used to create a simple and reasonably balanced split.
- High-value customers are defined purely based on spending within the dataset period.

Risk / ambiguity:
- The median threshold is somewhat arbitrary; customers close to the threshold may
  switch class with small changes in spending.


In [8]:
# TASK 2 CODE: Create classification target variable
# High-value customer = spending above median spending

median_spend = customer_df["TotalSpend"].median()  # threshold

customer_df["high_value_customer"] = (
    customer_df["TotalSpend"] > median_spend
).astype(int)

# Check target distribution (should be roughly balanced with median split)
print(customer_df["high_value_customer"].value_counts())
print(customer_df["high_value_customer"].value_counts(normalize=True))


high_value_customer
0    2156
1    2156
Name: count, dtype: int64
high_value_customer
0    0.5
1    0.5
Name: proportion, dtype: float64


## TASK 3: Feature Set and Explanations

We use the following features to model customer value:

1. **TotalSpend**  
   - Represents: total revenue contributed by the customer.  
   - Why useful: strongly indicates customer value and purchasing power.  
   - Caveat: highly skewed; extreme spenders can dominate patterns.

2. **NumTransactions**  
   - Represents: how many distinct purchases (invoices) the customer made.  
   - Why useful: captures engagement and repeat buying behaviour.  
   - Caveat: does not capture transaction size or timing.

3. **TotalQuantity**  
   - Represents: the total number of items the customer purchased.  
   - Why useful: indicates purchase volume, which can relate to value.  
   - Caveat: quantity does not account for item price differences.


In [9]:
# TASK 3 CODE: Create feature matrix (X) and target (y)

features = ["TotalSpend", "NumTransactions", "TotalQuantity"]  # modelling features
X = customer_df[features]
y = customer_df["high_value_customer"]

# Preview features and target
X.head(), y.head()


(   TotalSpend  NumTransactions  TotalQuantity
 0      372.86               11             70
 1     1323.32                2            828
 2      222.16                1            373
 3     2671.14                3            993
 4      300.93                1            261,
 0    0
 1    1
 2    0
 3    1
 4    0
 Name: high_value_customer, dtype: int64)

## TASK 4: Train Tree-Based and Ensemble Models

We train and compare three models:
1. Decision Tree (baseline)
2. Random Forest (bagging ensemble to reduce variance)
3. Gradient Boosting (boosting ensemble to reduce bias and improve accuracy)

We use simple hyperparameters to focus on correct modelling workflow and interpretation.


In [None]:
# CODE: Split into train / validation / test sets
from sklearn.model_selection import train_test_split

# Use stratify to keep class balance in splits
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp
)

print("Train:", X_train.shape, y_train.shape)
print("Val:  ", X_val.shape, y_val.shape)
print("Test: ", X_test.shape, y_test.shape)


Train: (3018, 3) (3018,)
Val:   (647, 3) (647,)
Test:  (647, 3) (647,)


In [11]:
# TASK 4 CODE: Fit Decision Tree, Random Forest, Gradient Boosting
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# 1) Decision Tree (baseline)
dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)

# 2) Random Forest (bagging ensemble)
rf = RandomForestClassifier(n_estimators=200, max_depth=6, random_state=42)
rf.fit(X_train, y_train)

# 3) Gradient Boosting (boosting ensemble)
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)


0,1,2
,loss,'log_loss'
,learning_rate,0.1
,n_estimators,100
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,3
,min_impurity_decrease,0.0


In [12]:
# TASK 5 CODE: Evaluate models on train and validation sets
from sklearn.metrics import accuracy_score

models = {
    "Decision Tree": dt,
    "Random Forest": rf,
    "Gradient Boosting": gb
}

results = []

for name, model in models.items():
    train_acc = accuracy_score(y_train, model.predict(X_train))
    val_acc = accuracy_score(y_val, model.predict(X_val))
    results.append((name, train_acc, val_acc))

# Print results neatly
for name, train_acc, val_acc in results:
    print(f"{name} | Train Acc: {train_acc:.4f} | Val Acc: {val_acc:.4f}")


Decision Tree | Train Acc: 1.0000 | Val Acc: 1.0000
Random Forest | Train Acc: 1.0000 | Val Acc: 1.0000
Gradient Boosting | Train Acc: 1.0000 | Val Acc: 1.0000


### TASK 5: Model Comparison (Train vs Validation)

All three models — Decision Tree, Random Forest, and Gradient Boosting —
achieve perfect accuracy on both the training and validation sets.

This indicates that the classification task is highly separable using the
chosen features. In particular, the target variable (`high_value_customer`)
is directly derived from customer spending, which is also included as a
feature, making the decision boundary very simple for tree-based models.

Under these conditions, no meaningful performance difference is observed
between a single decision tree and the ensemble methods. Although Random
Forest and Gradient Boosting are generally more flexible and powerful, their
advantages are not realised here due to the simplicity of the problem.

This result highlights a limitation of the modelling setup: when features
are strongly correlated with the target, train and validation performance
may appear overly optimistic and should be interpreted with caution.


In [13]:
# TASK 6 CODE: Final evaluation on the held-out test set (use once)
for name, model in models.items():
    test_acc = accuracy_score(y_test, model.predict(X_test))
    print(f"{name} | Test Acc: {test_acc:.4f}")


Decision Tree | Test Acc: 1.0000
Random Forest | Test Acc: 1.0000
Gradient Boosting | Test Acc: 1.0000


### TASK 6: Final Test Set Evaluation

The test set is evaluated **only once**, after all modelling decisions
(feature selection, target definition, and model choices) have been finalised.
This ensures that the test set provides an unbiased estimate of performance on
previously unseen customers.

All three models achieve perfect accuracy on the test set, consistent with the
validation results observed earlier. This suggests stable generalisation rather
than overfitting to the training data.

However, these results should be interpreted with caution. The simplicity of the
decision boundary — driven by strong alignment between the target variable and
the spending-based features — likely contributes to the uniformly high test
performance.

Small differences between models are not meaningful in this setting, and no
single model can be considered clearly superior based on test accuracy alone.

