# E-Commerce Funnel Analysis — Final Report

**Objective:**  
Use transactional data to understand customer behavior across the purchase funnel,
identify revenue drivers, and translate findings into actionable business insights.

**Dataset:** UCI Online Retail Dataset (2010–2011)  
**Author:** Hiruy Kassa  
**Tools:** Python, pandas, matplotlib, seaborn

## Executive Summary

- Analyzed ~390k completed purchases after cleaning and removing returns
- Identified sharp drop-off after first purchase
- Found strong revenue concentration among repeat and high-value customers
- Observed significant product-level revenue concentration
- RFM segmentation highlights a small group of “Champions” driving outsized value

**Key Insight:**  
Improving customer retention has a higher ROI than increasing acquisition volume.

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

cleaned = pd.read_csv("../data/processed/cleaned_purchases.csv")
customers = pd.read_csv("../data/processed/customer_metrics.csv")
products = pd.read_csv("../data/processed/product_performance.csv")
rfm = pd.read_csv("../data/processed/rfm_analysis.csv")

## Customer Funnel Overview

The funnel shows significant attrition after the first purchase,
with only a minority of customers becoming repeat or high-value buyers.

In [6]:
funnel_counts = {
    "One-Time Customers": (customers["order_count"] == 1).sum(),
    "Repeat Customers": (customers["order_count"] > 1).sum(),
    "High-Value Customers": (customers["total_revenue"] > 1000).sum()
}

pd.Series(funnel_counts).plot(kind="bar", title="Customer Funnel Breakdown")
plt.ylabel("Number of Customers")
plt.show()

KeyError: 'order_count'

## Customer Value Distribution

Customer revenue is highly skewed, with a small subset contributing a
disproportionate share of total revenue.

In [None]:
sns.histplot(customers["total_revenue"], bins=50)
plt.title("Customer Revenue Distribution")
plt.xlabel("Total Revenue per Customer")
plt.show()

## RFM Segmentation

RFM analysis highlights a small group of high-frequency, high-value customers
(“Champions”) that represent a critical revenue segment.

In [None]:
rfm.groupby("rfm_segment")["monetary"].mean().sort_values().plot(
    kind="barh", title="Average Revenue by RFM Segment"
)
plt.show()

## Product Performance

Revenue is concentrated among a limited number of products,
creating potential dependency risk.

In [None]:
top_products = products.sort_values("total_revenue", ascending=False).head(10)

sns.barplot(x="total_revenue", y="product", data=top_products)
plt.title("Top 10 Products by Revenue")
plt.show()

## Returns Overview

Returns represent a small but important subset of transactions
and should be monitored for operational and quality insights.

In [None]:
returns = pd.read_csv("../data/processed/returns.csv")
returns.shape

## Business Recommendations

1. Prioritize retention strategies for first-time buyers
2. Invest in loyalty programs targeting repeat and RFM “Champion” customers
3. Diversify product offerings to reduce revenue concentration risk
4. Monitor returns as a signal of product or fulfillment issues

## Limitations & Next Steps

**Limitations**
- Historical dataset (2010–2011)
- No marketing channel or session-level data
- No customer demographics

**Next Steps**
- Predict repeat purchase probability using classification models
- Integrate time-based cohort analysis
- Build dashboards for ongoing monitoring

## Conclusion

This project demonstrates how clean data pipelines and structured analysis
can uncover meaningful insights from transactional data.
The findings reinforce the importance of customer retention and data-driven
decision making in e-commerce systems.