# Product Segmentation using Clustering

In this notebook, we perform unsupervised product segmentation on the **Online Retail** dataset using **KMeans Clustering**.

### Objective
Segment products based on:
- Total quantity sold
- Average price per unit
- Number of transactions sold in

This helps retailers:
- Tailor promotions and bundling
- Prioritize inventory based on sales behavior

## Step 1: Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
import scipy.cluster.hierarchy as sch

## Step 2: Load and Clean the Dataset

In [None]:
df = pd.read_csv('../Data/Online_Retail.csv')
# Drop missing values
df = df.dropna(subset=['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
                       'UnitPrice', 'CustomerID', 'Country'])
# Convert invoiceDate to date-time formate
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], dayfirst=False)

In [None]:
# Preview
df.head()

## Step 3: Product-Level Feature Engineering

We group the data by product (`StockCode`) to compute:

- Total quantity sold
- Average price per unit
- Number of transactions the product appears in

We'll also create a `TotalRevenue` column for additional insight.


In [None]:
# Creating Product-level feature set
product_df = df.groupby('StockCode').agg({
    'Description': 'first',
    'Quantity': 'sum',
    'UnitPrice': 'mean',
    'InvoiceNo': pd.Series.nunique
}).rename(columns={
    'InvoiceNo': 'TransactionCount',
    'Quantity': 'TotalQuantitySold',
    'UnitPrice': 'AvgUnitPrice'
})
# Adding the total revenue column for EDA
product_df['TotalRevenue'] = product_df['TotalQuantitySold'] * product_df['AvgUnitPrice']
# Reset index so StockCode becomes a regular column instead of an index
product_df = product_df.reset_index()

In [None]:
# Preview
product_df.head()

## Step 4: Feature Selection and Scaling

We’ll select 3 numeric features that capture product-level sales behavior:

- TotalQuantitySold
- AvgUnitPrice
- TransactionCount

Then we’ll scale the data using **StandardScaler** before clustering.


In [None]:
# Feature selection
X = product_df.iloc[:, ['TotalQuantitySold', 'AvgUnitPrice', 'TransactionCount']]

# Feature scaling
sc = StandardScaler()
X = sc.fit_transform(X)

## Step 5: KMeans Clustering and Elbow Method

We’ll use the elbow method to determine the optimal number of clusters based on WCSS (Within-Cluster Sum of Squares), and then apply KMeans clustering.

In [None]:
# Elbow method to find optimal k
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=0)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

# Plotting the elbow graph
plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal k')
plt.show()

## Step 6: Applying KMeans Clustering

We’ll use `k=3` clusters (based on the elbow method) and assign a cluster label to each product.

In [None]:
# Fit KMeans with optimal clusters k=3
km = KMeans(n_clusters=3, init='k-means++', random_state=0)
y_kmeans = km.fit_predict(X)

# Inserting the cluster label to the product data frame
product_df['Cluster(KMeans)'] = y_kmeans

In [None]:
# Preview with clusters
product_df.head()

## Step 7: KMeans Cluster Summary

Let’s analyze how each cluster differs based on:

- Average quantity sold
- Average price per unit
- Number of transactions the product appears in

In [None]:
# Grouping by the kmeans clusters
km_cluster_summary = product_df.groupby('Cluster(KMeans)')[['TotalQuantitySold', 'AvgUnitPrice', 'TransactionCount']].mean()
print("KMeans Cluster Summary:")
print(km_cluster_summary)