- Overview
- Project Structure
- Prerequisites
- Installation
- Usage
- Pipeline Overview
- Output Files
- Configuration
This project performs basket segmentation and clustering analysis on POS (Point of Sale) data. It:
- Processes raw POS data and creates custom product categories
- Performs K-Means clustering based on quantity × value metrics
- Analyzes cluster profiles and generates comprehensive visualizations and reports
The pipeline is fully automated and generates insights about customer purchasing patterns, basket composition, and category preferences across different segments.
Analytics/
├── README.md # This file
├── clustering.py # Main entry point
│
├── data/ # Input data folder
│ ├── README.md # Data folder documentation
│ ├── POS_DATA_BAPR_2024-2025_updated.xlsx # ⭐ Required input file
│ ├── FINAL_POS_Data.xlsx # Processed data (auto-generated)
│ ├── FINAL_POS_Data_For_Input.xlsx # With custom categories (auto-generated)
│ └── [other CSV/Excel files] # Category and reference data
│
├── scripts/ # Python scripts
│ ├── add_second_custom_category.py # Creates second custom categories
│ ├── kmeans_Quantity_Value.py # K-Means clustering algorithm
│ ├── analysis_presence.py # Cluster profile analysis
│ └── [other analysis scripts]
│
├── outputs/ # Analysis results
│ ├── README.md # Outputs documentation
│ ├── *.csv # Data tables (top categories, lift, etc.)
│ └── *.png # Visualizations (heatmaps, bar charts)
│
├── outputs/ # Alternative output version
└── custom_category_analysis/ # Custom category analysis
- Python 3.8+
- Required Libraries:
pandas numpy scikit-learn matplotlib openpyxl
Install dependencies:
pip install pandas numpy scikit-learn matplotlib openpyxl-
Place the input file in the
data/folder:POS_DATA_BAPR_2024-2025_updated.xlsx -
Run the clustering pipeline:
python clustering.py
-
Check the results in the
outputs/folder
You can also run individual steps:
# Step 1: Create second-level categories
python scripts/add_second_custom_category.py
# Step 2: Perform K-Means clustering
python scripts/kmeans_Quantity_Value.py
# Step 3: Generate analysis and visualizations
python scripts/analysis_presence.py- Input:
POS_DATA_BAPR_2024-2025_updated.xlsx - Process: Maps first-level categories to second-level groupings (e.g., "ΠΡΩΙΝΟ" + "ΑΥΓΑ" → "ΠΡΩΙΝΟ - ΑΥΓΑ")
- Output:
FINAL_POS_Data_For_Input.xlsx
- Input:
FINAL_POS_Data_For_Input.xlsx - Metrics: Quantity × Value per category per basket
- Features: "qv_share" (share of quantity×value per basket, sums to 1)
- Clustering: Automatic K selection (2-10 clusters) via silhouette score
- Output:
basket_segments.xlsxwith:basket_matrixsheet (basket × category values)basket_clusterssheet (basket → cluster assignment)cluster_profilessheet (cluster × category composition %)k_searchsheet (K evaluation metrics)
- Calculates:
- Top 10 categories per cluster (by composition %)
- Presence percentages (% of baskets containing each category)
- Lift values (discriminative power)
- Generates: CSV reports and visualizations (heatmaps, bar charts)
- Output: Multiple files in
outputs/folder
top_categories_per_cluster.csv- Top 10 categories by value for each clusterdiscriminative_categories_per_cluster.csv- Top 10 discriminative categories (with lift scores)presence_percentages_per_cluster.csv- % of baskets containing each category per clusterlong_table_all_values.csv- Long-format table of all values
heatmap_values.png- Heatmap of all cluster × category valuesbar_topN_cluster_<id>.png- Top 10 categories by composition for each clusterbar_topN_presence_cluster_<id>.png- Top 10 categories by basket presencebar_topN_lift_cluster_<id>.png- Top 10 discriminative categories (by lift)comparison_cluster_<id>.png- Composition vs Presence comparison
Key parameters can be adjusted in scripts/kmeans_Quantity_Value.py:
# Number of clusters (set to None for auto-search)
FORCE_K = 9
# Feature calculation method
FEATURE_SET = "qv_share" # or "qv_sum"
# Value metric to use
VALUE_COL = "Value_" # Can be "Price", "Average_Price", "Price_Adjusted"
# K-Means search range (for auto-search)
K_MIN, K_MAX = 2, 10
# Random seed for reproducibility
RANDOM_STATE = 42The percentage of total quantity×value in a cluster from each category. Sum ≈ 100% per cluster.
The percentage of baskets in a cluster that contain at least one item from a category.
Ratio of a category's composition in a cluster vs. the global average:
- Lift > 1 = Overrepresented in this cluster
- Lift = 1 = Average representation
- Lift < 1 = Underrepresented in this cluster
"File not found" error:
- Ensure
POS_DATA_BAPR_2024-2025_updated.xlsxis in thedata/folder - Check file name spelling and extension
"Column not found" error:
- Verify the Excel sheet contains required columns:
Basket_ID,Barcode,Quantity,Value_,Custom Category
Path errors:
- Scripts use automatic path detection. Run from the project root:
python clustering.py
- All paths are automatically resolved relative to the project structure
- Outputs are overwritten on each run (except README files)
- The clustering process may take several minutes depending on data size
- Matplotlib plots are generated but not displayed in batch mode
Project Version: 1.0
Last Updated: November 2025