Problem Statement (Python Predictive Analytics Project)

The e-commerce business needs a reliable way to identify which existing customers are likely to become high-value in the future, so that marketing and retention efforts can be focused on the right people instead of treating all customers the same. Currently, segmentation is largely based on simple historical rules (e.g. looking at past spend only), which does not fully capture patterns of purchase frequency, basket size and recency, and cannot predict future value with confidence.

This project uses Python-based predictive analytics on historical transaction data (customer and purchase records) to build a model that predicts the likelihood that a customer will become a future high-value customer, based on their past purchasing behaviour. The output can be used to rank customers by predicted future value and support more targeted, data-driven marketing and retention strategies.

Research Objectives (5)

To define and construct a future high-value customer label
Develop an operational definition of “high-value” based on future total spend beyond a chosen cut-off date (e.g. top 25% of future spenders), and create this label at the customer level.

To engineer behavioural features from historical transactions
Aggregate past purchase data into customer-level features such as past total spend, number of orders, average order value and recency (days since last purchase), which can be used as inputs to the predictive model.

To build and evaluate a predictive classification model in Python
Use logistic regression (and potentially other models) to predict whether a customer will be a future high-value customer, and assess performance using metrics such as accuracy, precision, recall and F1-score on a held-out test set.

To interpret the drivers of future high-value behaviour
Analyse model coefficients and feature importance to understand how past spend, order frequency, basket size and recency are related to the probability of becoming a future high-value customer.

To translate model outputs into actionable customer targeting
Generate predicted probabilities for all customers, rank them by likelihood of becoming high-value, and propose how the business could use the top-ranked segments (e.g. top 20–30%) for targeted marketing or retention campaigns, while acknowledging model limitations and potential improvements.

In [None]:
# “Based on how a customer has behaved so far, can we predict whether they are a high-value customer?”

# We’ll tag customers as:

# 1 = High-value (top spenders)

# 0 = Not high-value (everyone else)

# Then we train a model to predict that tag using:

# How much they spent

# How many orders they placed

# How recently they bought

# Etc.

# This is a classification problem (yes/no).

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix


In [None]:
# pandas = Excel tables in code form.

# numpy = maths engine.

# train_test_split = splits data into training/testing pieces.

# LogisticRegression = simple, classic yes/no prediction model.

# classification_report / confusion_matrix = how good is our model.

In [3]:
customers = pd.read_csv("customer.csv")
purchases = pd.read_csv("purchase.csv")

print(customers.head())
print(purchases.head())


  customer_id     first_name last_name  gender  age country  income
0     CS00001        Isadora     Porto  Female   19  Brazil  117196
1     CS00002           Hugo   Carreño    Male   33   Chile   49256
2     CS00003           René   Olivera    Male   65  Mexico   33434
3     CS00004  Luiz Henrique     Pinto    Male   55  Brazil   75302
4     CS00005       Leonardo  Monteiro    Male   19  Brazil   32280
     order_id customer_id product_name  \
0  ODSHP00001     CS00001    Furniture   
1  ODSHP00002     CS00002        Dress   
2  ODSHP00003     CS00003    Furniture   
3  ODSHP00004     CS00004        Shoes   
4  ODSHP00005     CS00005         Rugs   

                                         description    price  discount   tax  \
0  Transform your space with this stylish and fun...   645.52      0.37  0.02   
1   Look and feel your best with this elegant dress.    28.90      0.05  0.02   
2  Transform your space with this stylish and fun...  3536.49      0.21  0.04   
3    Step out i

In [9]:
# Create a purchase_amount column
purchases["purchase_amount"] = purchases["price"] * purchases["quantity"] * (1 - purchases["discount"])


In [19]:
purchases["order_date"] = pd.to_datetime(
    purchases["order_date"],
    dayfirst=True,
    errors="coerce"
)


In [21]:
cust_agg = (
    purchases
    .groupby("customer_id")
    .agg(
        total_spend=("purchase_amount", "sum"),        # total money spent after discount
        num_orders=("order_id", "nunique"),            # how many orders
        avg_order_value=("purchase_amount", "mean"),   # average order size
        first_purchase_date=("order_date", "min"),
        last_purchase_date=("order_date", "max")
    )
    .reset_index()
)

In [23]:
threshold = cust_agg["total_spend"].quantile(0.75)
cust_agg["high_value"] = (cust_agg["total_spend"] >= threshold).astype(int)


In [25]:
print(cust_agg["last_purchase_date"].head())
print(cust_agg["last_purchase_date"].dtype)


0   2022-11-14
1   2022-12-29
2   2022-11-23
3   2022-12-14
4   2023-01-01
Name: last_purchase_date, dtype: datetime64[ns]
datetime64[ns]


In [27]:
max_date = cust_agg["last_purchase_date"].max()
data = cust_agg.copy()
data["recency_days"] = (max_date - data["last_purchase_date"]).dt.days


In [31]:
feature_cols = ["num_orders", "avg_order_value", "recency_days"]  # removed total_spend

X = data[feature_cols]
y = data["high_value"]

X = X.fillna(0)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[225   0]
 [  0  75]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       225
           1       1.00      1.00      1.00        75

    accuracy                           1.00       300
   macro avg       1.00      1.00      1.00       300
weighted avg       1.00      1.00      1.00       300



In [33]:
purchases["order_date"] = pd.to_datetime(purchases["order_date"], dayfirst=True)

cutoff_date = pd.to_datetime("2022-07-01")

past = purchases[purchases["order_date"] <= cutoff_date]
future = purchases[purchases["order_date"] > cutoff_date]


In [35]:
past_agg = (
    past
    .groupby("customer_id")
    .agg(
        past_total_spend=("purchase_amount", "sum"),
        past_num_orders=("order_id", "nunique"),
        past_avg_order_value=("purchase_amount", "mean"),
        last_past_purchase=("order_date", "max")
    )
    .reset_index()
)

# Recency at cutoff (how long before cutoff they last purchased)
past_agg["recency_days"] = (cutoff_date - past_agg["last_past_purchase"]).dt.days


In [37]:
future_agg = (
    future
    .groupby("customer_id")
    .agg(
        future_total_spend=("purchase_amount", "sum")
    )
    .reset_index()
)


In [39]:
data_pf = past_agg.merge(future_agg, on="customer_id", how="left")

# Customers with no future spend will have NaN -> treat as 0
data_pf["future_total_spend"] = data_pf["future_total_spend"].fillna(0)


In [41]:
threshold_future = data_pf["future_total_spend"].quantile(0.75)
data_pf["high_value_future"] = (data_pf["future_total_spend"] >= threshold_future).astype(int)


In [43]:
feature_cols = ["past_num_orders", "past_avg_order_value", "past_total_spend", "recency_days"]

X = data_pf[feature_cols]
y = data_pf["high_value_future"]

X = X.fillna(0)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[220   5]
 [ 57  18]]
              precision    recall  f1-score   support

           0       0.79      0.98      0.88       225
           1       0.78      0.24      0.37        75

    accuracy                           0.79       300
   macro avg       0.79      0.61      0.62       300
weighted avg       0.79      0.79      0.75       300



In [45]:
# Assuming feature_cols is still defined, e.g.
# feature_cols = ["past_num_orders", "past_avg_order_value", "past_total_spend", "recency_days"]

coeffs = pd.DataFrame({
    "feature": feature_cols,
    "coefficient": model.coef_[0]
})

print(coeffs)


                feature  coefficient
0       past_num_orders    -0.929147
1  past_avg_order_value    -0.003784
2      past_total_spend     0.000090
3          recency_days    -0.000705


In [47]:
# Use the same X (all customers in data_pf)
y_proba = model.predict_proba(X)[:, 1]   # probability of class 1 (high_value_future)

data_pf["predicted_prob_high_value"] = y_proba

print(data_pf[["customer_id", "predicted_prob_high_value"]].head())


  customer_id  predicted_prob_high_value
0     CS00001                   0.080037
1     CS00002                   0.076179
2     CS00003                   0.137395
3     CS00004                   0.048781
4     CS00005                   0.503636


In [49]:
# Sort customers by predicted probability (highest first)
data_pf_sorted = data_pf.sort_values(
    by="predicted_prob_high_value",
    ascending=False
)

# Example: top 20% of customers by predicted probability
top_20_percent_cutoff = int(len(data_pf_sorted) * 0.2)
top_customers = data_pf_sorted.head(top_20_percent_cutoff)

print(top_customers[["customer_id", "predicted_prob_high_value"]].head())


    customer_id  predicted_prob_high_value
547     CS00548                   0.983070
521     CS00522                   0.938331
131     CS00132                   0.923495
474     CS00475                   0.892360
986     CS00987                   0.870959
