<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Customer Segmentation Analysis: Online Retail</font></h1>
Assessment 2 - Machine Learning <br>
Naomi Chellsea Espiritu <br>
Student ID: 639876 <br>
<br>
The objective of this assessment is to perform Customer Segmentation on the 'Online Retail' dataset by using **K-Means Clustering** to a RFM (recency-frequency-monetary) model, we aim to identify distinct customer groups.
This allow the business to transition from a one size fits all marketing strategy to targeted involvement.
</div>


In [96]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import plotly.graph_objects as go
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

df = pd.read_csv('/kaggle/input/online-retail/online_retail.csv')




<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Data Cleaning and Preprocessing</font></h1>
Upon inspection, I have identified certain data quality issues including: <br>
1. Missing CustomerIDs: cant be segmented customers if we cant identify them. (Rows with null 'customerid' were removed <br>
2. Cancelled Orders: Transactions starting with C means cancellations. These create noise in the "Monetary" value so they were excluded. <br>
3. Negative Quantities: Likely returns or errors which were filtered out.
</div>

In [97]:
#removing empty CustomerIDs and Cancelled orders
df.dropna(subset=['CustomerID'], inplace=True)
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
df = df[df['Quantity'] > 0]

<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>The RFM Model</font></h1>
Raw transactional data (row-per-item) is not suitable for customer-level clustering, so I transformed the data into an RFM structure. <br>
* Recency: Days since the last purchase <br>
* Frequency: Total number of unique invoices <br>
* Monetary: Total spend <br>
This forms the data from thousands of transactions into a single row per customer.
</div>

In [98]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

rfm = df.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days, #Recency
    'InvoiceNo': 'nunique',                                  #Frequency
    'TotalPrice': 'sum'                                      #Monetary
}).rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency', 'TotalPrice': 'Monetary'})


<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Handling Skewness and Scaling</font></h1>
K-means calculates distances (euclidean) between data points. <br>
1. Skewness: Retail data follows a "Power Law" (Pareto Principle) where a few customers spend millions, while most spend little. This skews the cluster centers, so I have applied a Log Transformation to normalize the distribution. <br>
2. Scaling: Monetary values (eg: 5000) are much larger than frequency (eg: 5). Without scaling, the model would be biased entirely by money, so I have applied standardscaler to give all features equal weight.
</div>

In [99]:
#Retail data is heavily skewed, log Transformation for normalization.
rfm_log = np.log1p(rfm)

In [100]:
#Scaling
scaler = StandardScaler()
rfm_normalized = scaler.fit_transform(rfm_log)
rfm_normalized_df = pd.DataFrame(rfm_normalized, index=rfm.index, columns=rfm.columns)

print("Data Preprocessed and Scaled.")

Data Preprocessed and Scaled.


<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Determining 'K' (Model Selection)</font></h1>
Choosing the number of clusters is a critical decision. I used 2 validation techniques: <br>
1. The Elbow Method: This looks for the point where the reduction in variance slows down. (Inertia) <br>
2. Silhouette Score: This measures how similar an object is to its own cluster compared to others.
</div>

In [101]:
#Calculate inertia (elbow method) and silhouette score
inertia = []
silhouette_scores = []
K_range = range(2, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(rfm_normalized_df)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(rfm_normalized_df, kmeans.labels_))

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.update_layout(template="plotly_dark", title_text='Optimal k Analysis: Elbow & Silhouette', hovermode="x unified")

#plotting elbow
fig.add_trace(go.Scatter(x=list(K_range), y=inertia, name="Elbow (Inertia))", marker=dict(color='#00FFFF', size=12, line=dict(width=2, color='white')), line=dict(color='#00FFFF', width=4)), secondary_y=False)
fig.update_yaxes(title_text="Elbow", color='#00FFFF', secondary_y=False, showgrid=False)

#plotting Silhouette
fig.add_trace(go.Scatter(x=list(K_range), y=silhouette_scores, name="Silhouette Score", marker=dict(color='#FF00FF', size=12, symbol='diamond', line=dict(width=2, color='white')), line=dict(color='#FF00FF', width=4, dash='dot')), secondary_y=True)
fig.update_yaxes(title_text="Silhouette Score", color='#FF00FF', secondary_y=True, showgrid=False)

fig.update_xaxes(title_text="Number of Clusters (k)", showgrid=True, gridwidth=1, gridcolor='gray')
fig.show()

<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Model Development and Dimensionality Reduction (PCA)</font></h1>
Based on my interpretation, I selected k=3, while k=2 had a high mathematical score, it only splits the 'active' vs 'inactive'. A 3-cluster solution provides a 'middle' segment, which is more actionable for business growth. <br>

To visualize these 3-dimentional clusters (Recency, Frequency, Monetary) on a 2D screen, I applied PCA (principal component analysis) to reduce dimensions while retaining the maximum variance.
</div>

In [102]:
#k=3 for business relevance (Low, Mid, High value customers) 
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(rfm_normalized_df) #even if k=2 has a higher score, k=3 provides more usable insights.

#asssign clusters back to original data
rfm['Cluster'] = kmeans.labels_
rfm['ClusterNames'] = rfm['Cluster'].astype(str)

#Reducing 3 Dimensions to 2
pca = PCA(n_components=2)
pca_data = pca.fit_transform(rfm_normalized_df)
rfm['PCA1'] = pca_data[:, 0]
rfm['PCA2'] = pca_data[:, 1]

fig = px.scatter(
    rfm, 
    x='PCA1', 
    y='PCA2', 
    color='ClusterNames',
    title='<b>Customer Segments</b>: PCA Visualization',
    color_discrete_sequence=['#00FFFF', '#FF00FF', '#FFFF00'],
    template="plotly_dark",
)

fig.update_traces(
    marker=dict(size=12, line=dict(width=1, color='white'), opacity=0.8),
    selector=dict(mode='markers')
)

fig.update_layout(
    title_font=dict(size=24, color='white', family="Arial Black"),
    paper_bgcolor='#1e1e1e',
    plot_bgcolor='rgba(0,0,0,0)',
    legend_title_text='Segment',
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=1.02, 
        xanchor="right", 
        x=1,
        font=dict(size=14, color='white')
    ),
    xaxis=dict(showgrid=False, title_font=dict(size=16)),
    yaxis=dict(showgrid=True, gridcolor='#333', title_font=dict(size=16))
)

fig.show()

In [103]:
#dataframe specifically for the snake plot (NORMALIZED data)
#i've copied the normalized values, so it doesn't mess up the original dataframe
snek_plot = rfm_normalized_df.copy()
snek_plot['ClusterNames'] = rfm['Cluster'].astype(str)

snek_melt = pd.melt(snek_plot.reset_index(), 
                     id_vars=['CustomerID', 'ClusterNames'],
                     value_vars=['Recency', 'Frequency', 'Monetary'],
                     var_name='Metric', value_name='Value')

#average (Mean) Z-score for each cluster
fig = px.line(snek_melt.groupby(['ClusterNames', 'Metric'])['Value'].mean().reset_index(), 
              x='Metric', y='Value', color='ClusterNames',
              title='<b>Snake Plot of Standardized Values</b>',
              template='plotly_dark',
              markers=True,
              color_discrete_sequence=['#00FFFF', '#FF00FF', '#FFFF00'])

fig.update_layout(
    yaxis_title="Z-Score",
    title_font=dict(size=24, color='white', family="Arial Black"),
    paper_bgcolor='#1e1e1e',
    plot_bgcolor='rgba(0,0,0,0)',
    font=dict(color='white')
)
fig.show()

#print the actual means (raw numbers) for report table
print("Cluster Averages (the raw values)")
print(rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean())

Cluster Averages (the raw values)
            Recency  Frequency     Monetary
Cluster                                    
0         15.893733  13.698910  8101.497071
1        164.892316   1.360690   365.511684
2         43.931442   3.473995  1339.057449


In [104]:
clusterNames = {
    0: 'At Risk (High Recency)',
    1: 'Champions (High Spend)',
    2: 'Potential (Average)'
}

rfm['Names'] = rfm['Cluster'].map(clusterNames)
print(rfm['Names'].value_counts())

Names
Champions (High Spend)    1913
Potential (Average)       1692
At Risk (High Recency)     734
Name: count, dtype: int64


<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Critical Reflection</font></h1>
While my implementation successfully segmented the customer base, there are still several limitations that must be recognized. <br>

1. Limitations of K-Means <br>
   One significant challenge I have encountered was the geometric assumption of the K-Means algorithm. K-Means assumes that clusters are spherical and of roughly equal size. However, when I visualized the data using PCA, it became clear that the real customer behaviour was often irregular. The PCA visualization showed that the "Champion" cluster was more sparse than the dense "Hibernating" cluster. <br>

   A density-based algorithm like DBSCAN might have handled these irregularities better by identifying outliers as "noise" rather than forcing them into a cluster. <br>

3. The Static Nature of RFM <br>

   Currently, this model represents a static snapshot in time. A customer classified as a "Champion" today could become "At Risk" next month without the system detecting the transition until the next run. The model also lacks of context regarding seasonality. For example, a customer who only buys during Christmas might be unfairly penalized for high Recency in July, leading to incorrect segmentation. <br>

To improve: <br>

If I were to expand this model further, I would focus on these areas including:
Integration of Unstructured Data: Currently, I only look at how much they bought, not what they bought. Integrating the text descriptions using NLP techniques could reveal distinguished segments like Parents vs Gift Buyers. <br>

Longitudinal Analysis: To solve the static data issue, I would implementing a moving-window approach to track how customers migrate between clusters over time would provide an "Early Warning System" for churn (defection).
</div>