# E-commerce Customer Analytics

This notebook demonstrates a complete pipeline—from raw data cleaning to RFM segmentation and key visualizations—using the UCI Online Retail dataset.  
**Goals:**
1. Clean and preprocess raw transactions  
2. Compute RFM (Recency, Frequency, Monetary) metrics and quintile scores  
3. Generate actionable static visualizations  
4. Export results for downstream dashboards and reporting  


### Import Libraries and Load Dataset


## Table of Contents

1. [Setup & Imports](#setup)  
2. [Data Loading & Cleaning](#cleaning)  
3. [RFM Metric Calculation](#rfm-metrics)  
4. [RFM Scoring & Segmentation](#rfm-scoring)  
5. [Key Visualizations](#visualizations)  
6. [Export Results](#export)  
7. [Next Steps](#next-steps)  


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Global styles
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (8, 4)


## 1. Data Loading & Cleaning

Load the raw Excel file and perform basic cleaning:
- Drop duplicates  
- Remove cancelled orders (`InvoiceNo` starts with "C")  
- Remove transactions with missing `CustomerID`  


In [13]:
df = pd.read_excel("../data/Online Retail.xlsx")

# Initial glimpse
df.head(), df.info()

# Clean data
df_clean = (
    df
    .drop_duplicates()
    .loc[~df['InvoiceNo'].astype(str).str.startswith('C')]
    .dropna(subset=['CustomerID'])
)

# Save cleaned raw (optional)
df_clean.to_csv("../data/cleaned_retail.csv", index=False)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


# 02_rfm_segmentation.ipynb

## 🔍 Customer Segmentation with RFM

**Objective:**  
1. Load cleaned transaction data  
2. Compute Recency, Frequency, Monetary (RFM) metrics per customer  
3. Assign RFM scores and segments  
4. Export `rfm_segments.csv`


In [3]:
# path to your processed data
data_path = "../data/processed/cleaned_retail.csv"
df = pd.read_csv(data_path, parse_dates=['InvoiceDate'])

print("Loaded cleaned_retail:", df.shape)
df.head()


Loaded cleaned_retail: (397924, 9)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


## 2. RFM Metric Calculation

For each customer:
- **Recency** = days since last purchase (using snapshot_date)  
- **Frequency** = number of unique invoices  
- **MonetaryValue** = total spend (UnitPrice × Quantity)  


In [None]:
# Reference date = one day after last InvoiceDate
snapshot_date = df_clean['InvoiceDate'].max() + pd.Timedelta(days=1)

# Aggregate RFM metrics
rfm = (
    df
    .groupby('CustomerID')
    .agg(
        Recency       = ('InvoiceDate', lambda x: (snapshot_date - x.max()).days),
        Frequency     = ('InvoiceNo', 'nunique'),
        MonetaryValue = ('TotalPrice', 'sum')
    )
)



Unnamed: 0_level_0,Recency,Frequency,MonetaryValue
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346.0,326,1,77183.6
12347.0,2,7,4310.0
12348.0,75,4,1797.24
12349.0,19,1,1757.55
12350.0,310,1,334.4


## 3. RFM Scoring & Segmentation

1. Compute percent-rank for each metric (0–1).  
2. Map percent-rank to quintile scores 1–5.  
3. Invert Recency score so that more recent = higher score.  
4. Concatenate into a 3-digit `RFM_Segment` and sum into `RFM_Score` (3–15).


In [17]:
def to_quintile_score(series, invert=False):
    pct   = series.rank(method='min', pct=True)
    score = (pct * 5).astype(int) + 1
    score = score.clip(upper=5)
    return 6 - score if invert else score

# Percent-rank columns
rfm['Recency_Pct']   = rfm['Recency'].rank(method='min', pct=True)
rfm['Frequency_Pct'] = rfm['Frequency'].rank(method='min', pct=True)
rfm['Monetary_Pct']  = rfm['MonetaryValue'].rank(method='min', pct=True)

# Quintile scores
rfm['R_Score'] = to_quintile_score(rfm['Recency'],        invert=True)
rfm['F_Score'] = to_quintile_score(rfm['Frequency'],      invert=False)
rfm['M_Score'] = to_quintile_score(rfm['MonetaryValue'],  invert=False)

# Segment code & total score
rfm['RFM_Segment'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
rfm['RFM_Score']   = rfm[['R_Score','F_Score','M_Score']].sum(axis=1)

rfm.head()

def map_rfm_to_category(row):
    R, F = row['R_Score'], row['F_Score']
    if R >= 4 and F >= 4:
        return "Champions"
    if R >= 3 and F >= 3:
        return "Loyal Customers"
    if R >= 4 and 2 <= F <= 3:
        return "Potential Loyalist"
    if R == 5 and F <= 2:
        return "New Customers"
    if 2 <= R <= 3 and F >= 4:
        return "At Risk"
    if R <= 2 and 2 <= F <= 3:
        return "Need Attention"
    if R <= 2 and F <= 2:
        return "Hibernating"
    return "Others"

rfm['Category'] = rfm.apply(map_rfm_to_category, axis=1)

country_mode = (
    df_clean
    .dropna(subset=['CustomerID'])
    .groupby('CustomerID')['Country']
    .agg(lambda x: x.mode().iat[0] if not x.mode().empty else x.iloc[0])
)

# Añadir al DataFrame rfm
rfm = rfm.join(country_mode.rename('Country'))

# Finalmente, guarda
rfm.to_csv('../data/rfm_segments.csv', index=True)



## 4. Key Visualizations

1. **Top 10 RFM Segments + “Other”**  
2. **Distribution by RFM_Score**  
3. **Recency vs Frequency Heatmap**  
4. **Top 5 Segments by Revenue Share**  
5. **Spend Boxplot in Top 5 Segments**  


In [None]:
# 1. Top 10 + Other
counts   = rfm['RFM_Segment'].value_counts()
top10    = counts.nlargest(10)
others   = counts.drop(top10.index).sum()
plot1    = top10.copy()
plot1['Other'] = others

plot1.sort_values().plot(kind='barh')
plt.title('Top 10 RFM Segments + Other')
plt.xlabel('Number of Customers')
plt.tight_layout()
plt.show()


In [None]:
# 2. RFM_Score distribution
score_counts = rfm['RFM_Score'].value_counts().sort_index()
score_counts.plot(kind='bar')
plt.title('Distribution by RFM Score (3–15)')
plt.xlabel('RFM Score')
plt.ylabel('Number of Customers')
plt.tight_layout()
plt.show()


In [None]:
# 3. Heatmap of avg spend by R×F
pivot = rfm.pivot_table(index='R_Score', columns='F_Score', values='MonetaryValue', aggfunc='mean')
sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu", cbar_kws={'label':'Avg Spend'})
plt.title('Average Spend by Recency & Frequency')
plt.xlabel('Frequency Score')
plt.ylabel('Recency Score')
plt.tight_layout()
plt.show()


In [None]:
# 4. Top 5 segments by revenue
rev   = rfm.groupby('RFM_Segment')['MonetaryValue'].sum()
share = (rev / rev.sum()).sort_values(ascending=False).head(5)
share.plot(kind='bar')
plt.title('Top 5 RFM Segments by Revenue Share')
plt.xlabel('RFM Segment')
plt.ylabel('Revenue Share')
plt.tight_layout()
plt.show()


In [None]:
# 5. Boxplot for top 5 segments
top5   = share.index.tolist()
subset = rfm[rfm['RFM_Segment'].isin(top5)]
sns.boxplot(x='RFM_Segment', y='MonetaryValue', data=subset)
plt.title('Spend Distribution in Top 5 Segments')
plt.xlabel('RFM Segment')
plt.ylabel('Monetary Value')
plt.tight_layout()
plt.show()


## 5. Export Results & Next Steps

- **Exported** `rfm` to `data/rfm_segments.csv`.  
- **Next:** build an interactive Streamlit dashboard (`dashboards/app.py`) that allows filtering by date, segment, and drilling into customer profiles.
