Skip to content

The company Amazing, an e-commerce marketplace, wishes to better understand the behaviors of its users in order to improve personalization and marketing performance.

Notifications You must be signed in to change notification settings

pinsdev24/client_segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amazing Client Segmentation

1. Context

The company Amazing, an e-commerce marketplace, wishes to better understand the behaviors of its users in order to improve personalization and marketing performance.

It has a massive volume of user event data (several tens of GB/month), including the following interactions:

  • view (product consultation)
  • cart (add to cart)
  • remove_from_cart
  • purchase

Each event includes a timestamp, a user_id, a product_id, price, brand, category, etc.

The global objective is to build a behavioral segmentation model, capable of grouping users according to their navigation and purchasing behavior.


2. Project Objectives

Main Objective

Build a client segmentation based on unsupervised learning methods to identify behavioral profiles actionable by marketing.

Operational Objectives

  • Explore and clean massive event data.
  • Build relevant user features.
  • Reduce dimension via PCA.
  • Apply a clustering method (K-Means).
  • Interpret clusters and propose targeted marketing actions.
  • Provide a reproducible pipeline.

Technologies Used

  • Data Processing: Pandas, DuckDB, Polars
  • Visualization: Plotly, Matplotlib
  • Machine Learning: Sklearn, Clustering, PCA
  • Environment: Google Colab, Kaggle

3. Data

The data comes from raw monthly files YYYY-MMM.csv.gz (October 2019 → April 2020).

Each file weighs between 2.5 GB compressed and 5–7 GB decompressed.

🏗️ For memory constraint reasons, the initial analysis was conducted on a sample of October only, Then on samples of 5M examples per month (Oct→ April).

Main Variables

  • event_time (timestamp)
  • event_type (str) : view, cart, remove, purchase
  • product_id, category_id, category_code
  • brand, price
  • user_id, user_session

4. Data Preparation & Cleaning

4.1 Scalable Loading

Given the massive size of the files, the strategy adopted was:

  • Reading by batch via DuckDB
  • Conversion to Parquet for optimization
  • Lazy loading for EDA

4.2 Cleaning

The following operations were performed:

✔ Duplicate Removal

De-duplication based on:

(event_time, user_id, product_id, event_type)

✔ Missing Value Management

  • category_code : ≈ 32% NA → conservation + fallback via category_id
  • brand : ≈ 14% NA → creation of an indicator brand_missing = True
  • Critical columns (event_time, event_type, price, user_id) : 0% NA

✔ Price Outlier Treatment

Extreme high threshold set at 0.999 quantile.

Values beyond filtered.

✔ Initial Feature Engineering

  • event_date, event_hour
  • Logical cleaning (price < 0, outliers)

5. EDA — Data Exploration

5.1 Event Type Distribution

"Views" represent the majority of events, followed by "carts".

"Purchases" remain rare → very wide funnel.

Insight: global conversion rate < 1 %.

October data only

October data only

Sampled data over all months

Sampled data over all months


5.2 Temporal Analysis

Hourly Activity

  • Strong activity between 4am and 6pm

  • Peaks in the evening (after-work shopping)

    Month of October

    Month of October

All months

All months

Daily Activity

  • Regular curve, no major behavioral break over the month.

    Month of October only

    Month of October only

All sampled months (this is only normal because we only took data from the beginning of each month)

All sampled months (this is only normal because we only took data from the beginning of each month)


5.3 Price Analysis

  • Very asymmetrical distribution (long tail).
  • 95% of products < 500 €.
  • Premium products create extremes (phones, laptops…).

5.4 Dominant Categories & Brands

Most Visited Categories:

  • electronics.smartphone

  • electronics.clock

  • computer.notebook

  • electronics.video.tv

    newplot (5).png

Most Purchased Categories:

  • Smartphones

  • headphone

    newplot.png

Dominant Brands:

  • Samsung, Apple, Xiaomi, Huawei

    newplot (1).png

Insight: the market is very polarized on high-tech.


5.5 User Behavioral Analysis

  • Very heterogeneous activity:

    • A minority of users generate the majority of events.
    • Many “one-time visitors”.
  • Distribution of number of events per user very skewed.

    newplot.png

Simplified Funnel

Objective: understand global conversion.

conversion = totals["purchase"] / totals["view"] * 100

Global conversion rate: 1.82% for the month of October

Global conversion rate: 1.6% for the month of October

Heatmap Categories vs Event Types

Objective: spot categories where the purchase/view ratio is abnormally low or high.

For the month of October only

For the month of October only

For all months

For all months


6. User Feature Construction

For each user, the following features were calculated:

Activity Variables

  • total_events
  • views, carts, purchases
  • n_sessions
  • days_active
  • recency_days

Monetary Variables

  • total_spent
  • avg_price_purchase

Derived Variables

  • purchase_rate = purchases / views
  • cart_rate
  • remove_rate
  • cart_to_purchase = purchases / carts
  • night_share
  • n_categories_viewed, n_brands_viewed

7. Modeling

7.1 Standardization

Variables were normalized via StandardScaler.

7.2 PCA — Dimensionality Reduction

  • The first 6 components explain > 80 % of the variance.
  • PCA used as space for clustering (X_pca).

image.png

7.3 Determination of Optimal Number of Clusters

The elbow method shows a clear inflection point around k = 4 or 5.

→ The choice retained: k = 5 (good compromise between fineness and readability).

image.png

7.4 Clustering (K-Means)

  • Training with k = 4 on the first 10 PCA components.
  • Separation visualized on 2D PCA graph.

image.png


8. Analysis, Interpretation of Clusters and Business Discussion

Thanks to boxplots (recency_days, purchases, total_spent, purchase_rate) and cluster_summary averages, the following profiles emerged:

image.png

image.png

image.png

image.png


Interpretation

1. Cluster 2 : "The VIPs (Whales)" — Absolute Priority

This is by far the most valuable segment in terms of turnover, but it is atypical.

  • Visual Proof:
    • total_spent : They crush all other groups. The box is located very high (between 500k and 1M+), while others are crushed at 0 on the scale.
    • purchases : Surprisingly low (median close to 0, max around 30).
  • Interpretation: These are clients who make few purchases, but for astronomical amounts. They probably buy very high-end products or make massive B2B orders.
  • Marketing Action: Concierge service, exclusive offers, VIP treatment. Do not lose them.

2. Cluster 3 : "Confirmed Buyers" — The Core Target

This is the only group that has a "healthy" and regular purchasing behavior in terms of conversion.

  • Visual Proof:
    • purchase_rate : It is the only cluster with significant activity here (median ~25%, going up to 100%). Others are at 0.
    • recency_days : The median is quite high (~125 days). Careful, they are starting to "cool down".
    • total_spent : Low. They buy, but small amounts.
  • Interpretation: These are clients who convert well when they come, but they spend little per order and haven't come recently.
  • Marketing Action: Win-back campaign (they are drifting away) + Up-selling (make them buy more expensive products because they already trust).

3. Cluster 0 : "Window Shoppers (Active Curious)" — Untapped Potential

This group is very active on the site but does not take action.

  • Visual Proof:
    • recency_days : Very low (median ~50 days). They came very recently.
    • purchase_rate : Almost null (crushed at 0).
    • purchases : There are a lot of outliers (black dots) going high. Some buy a lot, but the majority (the box) is at 0.
  • Interpretation: They come often, look a lot, but don't buy. There might be a price or shipping cost issue for this segment.
  • Marketing Action: Retargeting. They need a "nudge" (promo code, free shipping) to trigger the first purchase.

4. Cluster 1 : "Inactive / Churners" — Background Noise

This group likely represents a large part of your database that has no immediate value anymore.

  • Visual Proof:
    • recency_days : The box is spread upwards (up to 200+ days).
    • total_spent & purchases : Flat at zero.
  • Interpretation: These are old visitors who never bought or don't buy anymore.
  • Marketing Action: Do not spend budget on them. Send an automated "last chance" email campaign, and if they don't react, clean the base.

Strategic Summary

Cluster Profile Name Key Characteristic Recommended Strategy
2 VIP / High-Rollers Massive spending, low volume Premium Loyalty (Care)
3 Buyers Good conversion rate, low baskets Cross-sell / Upsell (Increase basket)
0 Active Visitors Recent visits, no purchase Conversion (Trigger promotions)
1 Inactive Old, no spending Cleaning / Automation (Low priority)

9. Conclusion & Recommendations

This project allowed to:

  • understand user behaviors at scale,
  • identify 4 relevant segments,
  • propose concrete marketing levers.

Priority Recommendations:

  1. Launch a VIP program for Premium buyers.
  2. Set up personalized notifications for efficient buyers.
  3. Segment newsletters according to clusters.
  4. Conduct a “winback” campaign targeting the dormant cluster.
  5. Optimize the funnel for passive explorers (cluster 1).

10. Areas for Improvement

  • Scaling to multi-month (via DuckDB + Polars Lazy).
  • Addition of product features (preferred category, seasonality).
  • Test of more robust methods: HDBSCAN, Gaussian Mixture Models.
  • Industrialization via API + dashboard.

📁 Deliverables Provided


About

The company Amazing, an e-commerce marketplace, wishes to better understand the behaviors of its users in order to improve personalization and marketing performance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published