The company Amazing, an e-commerce marketplace, wishes to better understand the behaviors of its users in order to improve personalization and marketing performance.
It has a massive volume of user event data (several tens of GB/month), including the following interactions:
- view (product consultation)
- cart (add to cart)
- remove_from_cart
- purchase
Each event includes a timestamp, a user_id, a product_id, price, brand, category, etc.
The global objective is to build a behavioral segmentation model, capable of grouping users according to their navigation and purchasing behavior.
Build a client segmentation based on unsupervised learning methods to identify behavioral profiles actionable by marketing.
- Explore and clean massive event data.
- Build relevant user features.
- Reduce dimension via PCA.
- Apply a clustering method (K-Means).
- Interpret clusters and propose targeted marketing actions.
- Provide a reproducible pipeline.
- Data Processing: Pandas, DuckDB, Polars
- Visualization: Plotly, Matplotlib
- Machine Learning: Sklearn, Clustering, PCA
- Environment: Google Colab, Kaggle
The data comes from raw monthly files YYYY-MMM.csv.gz (October 2019 → April 2020).
Each file weighs between 2.5 GB compressed and 5–7 GB decompressed.
🏗️ For memory constraint reasons, the initial analysis was conducted on a sample of October only, Then on samples of 5M examples per month (Oct→ April).
- event_time (timestamp)
- event_type (str) : view, cart, remove, purchase
- product_id, category_id, category_code
- brand, price
- user_id, user_session
Given the massive size of the files, the strategy adopted was:
- Reading by batch via DuckDB
- Conversion to Parquet for optimization
- Lazy loading for EDA
The following operations were performed:
De-duplication based on:
(event_time, user_id, product_id, event_type)
category_code: ≈ 32% NA → conservation + fallback viacategory_idbrand: ≈ 14% NA → creation of an indicatorbrand_missing = True- Critical columns (
event_time,event_type,price,user_id) : 0% NA
Extreme high threshold set at 0.999 quantile.
Values beyond filtered.
event_date,event_hour- Logical cleaning (price < 0, outliers)
"Views" represent the majority of events, followed by "carts".
"Purchases" remain rare → very wide funnel.
Insight: global conversion rate < 1 %.
October data only
Sampled data over all months
All months
All sampled months (this is only normal because we only took data from the beginning of each month)
- Very asymmetrical distribution (long tail).
- 95% of products < 500 €.
- Premium products create extremes (phones, laptops…).
Insight: the market is very polarized on high-tech.
-
Very heterogeneous activity:
- A minority of users generate the majority of events.
- Many “one-time visitors”.
-
Distribution of number of events per user very skewed.
Objective: understand global conversion.
conversion = totals["purchase"] / totals["view"] * 100
Global conversion rate: 1.82% for the month of October
Global conversion rate: 1.6% for the month of October
Objective: spot categories where the purchase/view ratio is abnormally low or high.
For the month of October only
For all months
For each user, the following features were calculated:
total_eventsviews,carts,purchasesn_sessionsdays_activerecency_days
total_spentavg_price_purchase
purchase_rate = purchases / viewscart_rateremove_ratecart_to_purchase = purchases / cartsnight_sharen_categories_viewed,n_brands_viewed
Variables were normalized via StandardScaler.
- The first 6 components explain > 80 % of the variance.
- PCA used as space for clustering (X_pca).
The elbow method shows a clear inflection point around k = 4 or 5.
→ The choice retained: k = 5 (good compromise between fineness and readability).
- Training with
k = 4on the first 10 PCA components. - Separation visualized on 2D PCA graph.
Thanks to boxplots (recency_days, purchases, total_spent, purchase_rate) and cluster_summary averages, the following profiles emerged:
This is by far the most valuable segment in terms of turnover, but it is atypical.
- Visual Proof:
total_spent: They crush all other groups. The box is located very high (between 500k and 1M+), while others are crushed at 0 on the scale.purchases: Surprisingly low (median close to 0, max around 30).
- Interpretation: These are clients who make few purchases, but for astronomical amounts. They probably buy very high-end products or make massive B2B orders.
- Marketing Action: Concierge service, exclusive offers, VIP treatment. Do not lose them.
This is the only group that has a "healthy" and regular purchasing behavior in terms of conversion.
- Visual Proof:
purchase_rate: It is the only cluster with significant activity here (median ~25%, going up to 100%). Others are at 0.recency_days: The median is quite high (~125 days). Careful, they are starting to "cool down".total_spent: Low. They buy, but small amounts.
- Interpretation: These are clients who convert well when they come, but they spend little per order and haven't come recently.
- Marketing Action: Win-back campaign (they are drifting away) + Up-selling (make them buy more expensive products because they already trust).
This group is very active on the site but does not take action.
- Visual Proof:
recency_days: Very low (median ~50 days). They came very recently.purchase_rate: Almost null (crushed at 0).purchases: There are a lot of outliers (black dots) going high. Some buy a lot, but the majority (the box) is at 0.
- Interpretation: They come often, look a lot, but don't buy. There might be a price or shipping cost issue for this segment.
- Marketing Action: Retargeting. They need a "nudge" (promo code, free shipping) to trigger the first purchase.
This group likely represents a large part of your database that has no immediate value anymore.
- Visual Proof:
recency_days: The box is spread upwards (up to 200+ days).total_spent&purchases: Flat at zero.
- Interpretation: These are old visitors who never bought or don't buy anymore.
- Marketing Action: Do not spend budget on them. Send an automated "last chance" email campaign, and if they don't react, clean the base.
| Cluster | Profile Name | Key Characteristic | Recommended Strategy |
|---|---|---|---|
| 2 | VIP / High-Rollers | Massive spending, low volume | Premium Loyalty (Care) |
| 3 | Buyers | Good conversion rate, low baskets | Cross-sell / Upsell (Increase basket) |
| 0 | Active Visitors | Recent visits, no purchase | Conversion (Trigger promotions) |
| 1 | Inactive | Old, no spending | Cleaning / Automation (Low priority) |
This project allowed to:
- understand user behaviors at scale,
- identify 4 relevant segments,
- propose concrete marketing levers.
- Launch a VIP program for Premium buyers.
- Set up personalized notifications for efficient buyers.
- Segment newsletters according to clusters.
- Conduct a “winback” campaign targeting the dormant cluster.
- Optimize the funnel for passive explorers (cluster 1).
- Scaling to multi-month (via DuckDB + Polars Lazy).
- Addition of product features (preferred category, seasonality).
- Test of more robust methods: HDBSCAN, Gaussian Mixture Models.
- Industrialization via API + dashboard.
- Notebook :
01_data_sampling.ipynb - Notebook :
02_data_clean_eda.ipynb - Notebook :
03_modelisation.ipynb,03_modelisation_oct.ipynb - Dataset :
users_features.parquet,users_features_oct.parquet - Clustering results :
users_clustered.parquet - Report : this document
- Drive link: https://drive.google.com/drive/folders/1fOeQKvOSnPbKJKWEVjhw5D6_m7Z93KjE?usp=sharing
.png)

.png)
.png)
%201.png)
.png)
.png)

%201.png)

%202.png)
%201.png)






