Skip to content

gregalexan/analytics

Repository files navigation

Business Analytics and Personalization Technologies

📋 Table of Contents


🎯 Overview

This project performs basket segmentation and clustering analysis on POS (Point of Sale) data. It:

  1. Processes raw POS data and creates custom product categories
  2. Performs K-Means clustering based on quantity × value metrics
  3. Analyzes cluster profiles and generates comprehensive visualizations and reports

The pipeline is fully automated and generates insights about customer purchasing patterns, basket composition, and category preferences across different segments.


📁 Project Structure

Analytics/
├── README.md                          # This file
├── clustering.py                      # Main entry point
│
├── data/                              # Input data folder
│   ├── README.md                      # Data folder documentation
│   ├── POS_DATA_BAPR_2024-2025_updated.xlsx  # ⭐ Required input file
│   ├── FINAL_POS_Data.xlsx            # Processed data (auto-generated)
│   ├── FINAL_POS_Data_For_Input.xlsx  # With custom categories (auto-generated)
│   └── [other CSV/Excel files]        # Category and reference data
│
├── scripts/                           # Python scripts
│   ├── add_second_custom_category.py  # Creates second custom categories
│   ├── kmeans_Quantity_Value.py       # K-Means clustering algorithm
│   ├── analysis_presence.py           # Cluster profile analysis
│   └── [other analysis scripts]
│
├── outputs/                           # Analysis results
│   ├── README.md                      # Outputs documentation
│   ├── *.csv                          # Data tables (top categories, lift, etc.)
│   └── *.png                          # Visualizations (heatmaps, bar charts)
│
├── outputs/                           # Alternative output version
└── custom_category_analysis/          # Custom category analysis

✅ Prerequisites

  • Python 3.8+
  • Required Libraries:
    pandas
    numpy
    scikit-learn
    matplotlib
    openpyxl
    

Installation

Install dependencies:

pip install pandas numpy scikit-learn matplotlib openpyxl

🚀 Usage

Quick Start

  1. Place the input file in the data/ folder:

    POS_DATA_BAPR_2024-2025_updated.xlsx
    
  2. Run the clustering pipeline:

    python clustering.py
  3. Check the results in the outputs/ folder

Step-by-Step Execution

You can also run individual steps:

# Step 1: Create second-level categories
python scripts/add_second_custom_category.py

# Step 2: Perform K-Means clustering
python scripts/kmeans_Quantity_Value.py

# Step 3: Generate analysis and visualizations
python scripts/analysis_presence.py

🔄 Pipeline Overview

Step 1: Custom Category Processing

  • Input: POS_DATA_BAPR_2024-2025_updated.xlsx
  • Process: Maps first-level categories to second-level groupings (e.g., "ΠΡΩΙΝΟ" + "ΑΥΓΑ" → "ΠΡΩΙΝΟ - ΑΥΓΑ")
  • Output: FINAL_POS_Data_For_Input.xlsx

Step 2: K-Means Clustering

  • Input: FINAL_POS_Data_For_Input.xlsx
  • Metrics: Quantity × Value per category per basket
  • Features: "qv_share" (share of quantity×value per basket, sums to 1)
  • Clustering: Automatic K selection (2-10 clusters) via silhouette score
  • Output: basket_segments.xlsx with:
    • basket_matrix sheet (basket × category values)
    • basket_clusters sheet (basket → cluster assignment)
    • cluster_profiles sheet (cluster × category composition %)
    • k_search sheet (K evaluation metrics)

Step 3: Cluster Analysis

  • Calculates:
    • Top 10 categories per cluster (by composition %)
    • Presence percentages (% of baskets containing each category)
    • Lift values (discriminative power)
  • Generates: CSV reports and visualizations (heatmaps, bar charts)
  • Output: Multiple files in outputs/ folder

📊 Output Files

Data Files (CSV)

  • top_categories_per_cluster.csv - Top 10 categories by value for each cluster
  • discriminative_categories_per_cluster.csv - Top 10 discriminative categories (with lift scores)
  • presence_percentages_per_cluster.csv - % of baskets containing each category per cluster
  • long_table_all_values.csv - Long-format table of all values

Visualizations (PNG)

  • heatmap_values.png - Heatmap of all cluster × category values
  • bar_topN_cluster_<id>.png - Top 10 categories by composition for each cluster
  • bar_topN_presence_cluster_<id>.png - Top 10 categories by basket presence
  • bar_topN_lift_cluster_<id>.png - Top 10 discriminative categories (by lift)
  • comparison_cluster_<id>.png - Composition vs Presence comparison

⚙️ Configuration

Key parameters can be adjusted in scripts/kmeans_Quantity_Value.py:

# Number of clusters (set to None for auto-search)
FORCE_K = 9

# Feature calculation method
FEATURE_SET = "qv_share"  # or "qv_sum"

# Value metric to use
VALUE_COL = "Value_"  # Can be "Price", "Average_Price", "Price_Adjusted"

# K-Means search range (for auto-search)
K_MIN, K_MAX = 2, 10

# Random seed for reproducibility
RANDOM_STATE = 42

📈 Key Metrics Explained

Composition %

The percentage of total quantity×value in a cluster from each category. Sum ≈ 100% per cluster.

Presence %

The percentage of baskets in a cluster that contain at least one item from a category.

Lift

Ratio of a category's composition in a cluster vs. the global average:

  • Lift > 1 = Overrepresented in this cluster
  • Lift = 1 = Average representation
  • Lift < 1 = Underrepresented in this cluster

🛠️ Troubleshooting

"File not found" error:

  • Ensure POS_DATA_BAPR_2024-2025_updated.xlsx is in the data/ folder
  • Check file name spelling and extension

"Column not found" error:

  • Verify the Excel sheet contains required columns: Basket_ID, Barcode, Quantity, Value_, Custom Category

Path errors:

  • Scripts use automatic path detection. Run from the project root: python clustering.py

📝 Notes

  • All paths are automatically resolved relative to the project structure
  • Outputs are overwritten on each run (except README files)
  • The clustering process may take several minutes depending on data size
  • Matplotlib plots are generated but not displayed in batch mode

Project Version: 1.0
Last Updated: November 2025

About

Repository for Business Analytics and Personalization Technologies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages