Business Analytics and Personalization Technologies

📋 Table of Contents

Overview
Project Structure
Prerequisites
Installation
Usage
Pipeline Overview
Output Files
Configuration

🎯 Overview

This project performs basket segmentation and clustering analysis on POS (Point of Sale) data. It:

Processes raw POS data and creates custom product categories
Performs K-Means clustering based on quantity × value metrics
Analyzes cluster profiles and generates comprehensive visualizations and reports

The pipeline is fully automated and generates insights about customer purchasing patterns, basket composition, and category preferences across different segments.

📁 Project Structure

Analytics/
├── README.md                          # This file
├── clustering.py                      # Main entry point
│
├── data/                              # Input data folder
│   ├── README.md                      # Data folder documentation
│   ├── POS_DATA_BAPR_2024-2025_updated.xlsx  # ⭐ Required input file
│   ├── FINAL_POS_Data.xlsx            # Processed data (auto-generated)
│   ├── FINAL_POS_Data_For_Input.xlsx  # With custom categories (auto-generated)
│   └── [other CSV/Excel files]        # Category and reference data
│
├── scripts/                           # Python scripts
│   ├── add_second_custom_category.py  # Creates second custom categories
│   ├── kmeans_Quantity_Value.py       # K-Means clustering algorithm
│   ├── analysis_presence.py           # Cluster profile analysis
│   └── [other analysis scripts]
│
├── outputs/                           # Analysis results
│   ├── README.md                      # Outputs documentation
│   ├── *.csv                          # Data tables (top categories, lift, etc.)
│   └── *.png                          # Visualizations (heatmaps, bar charts)
│
├── outputs/                           # Alternative output version
└── custom_category_analysis/          # Custom category analysis

✅ Prerequisites

Python 3.8+

Required Libraries:

pandas
numpy
scikit-learn
matplotlib
openpyxl

Installation

Install dependencies:

pip install pandas numpy scikit-learn matplotlib openpyxl

🚀 Usage

Quick Start

Place the input file in the data/ folder:
```
POS_DATA_BAPR_2024-2025_updated.xlsx
```
Run the clustering pipeline:
```
python clustering.py
```
Check the results in the outputs/ folder

Step-by-Step Execution

You can also run individual steps:

# Step 1: Create second-level categories
python scripts/add_second_custom_category.py

# Step 2: Perform K-Means clustering
python scripts/kmeans_Quantity_Value.py

# Step 3: Generate analysis and visualizations
python scripts/analysis_presence.py

🔄 Pipeline Overview

Step 1: Custom Category Processing

Input: POS_DATA_BAPR_2024-2025_updated.xlsx
Process: Maps first-level categories to second-level groupings (e.g., "ΠΡΩΙΝΟ" + "ΑΥΓΑ" → "ΠΡΩΙΝΟ - ΑΥΓΑ")
Output: FINAL_POS_Data_For_Input.xlsx

Step 2: K-Means Clustering

Input: FINAL_POS_Data_For_Input.xlsx
Metrics: Quantity × Value per category per basket
Features: "qv_share" (share of quantity×value per basket, sums to 1)
Clustering: Automatic K selection (2-10 clusters) via silhouette score
Output: basket_segments.xlsx with:
- basket_matrix sheet (basket × category values)
- basket_clusters sheet (basket → cluster assignment)
- cluster_profiles sheet (cluster × category composition %)
- k_search sheet (K evaluation metrics)

Step 3: Cluster Analysis

Calculates:
- Top 10 categories per cluster (by composition %)
- Presence percentages (% of baskets containing each category)
- Lift values (discriminative power)
Generates: CSV reports and visualizations (heatmaps, bar charts)
Output: Multiple files in outputs/ folder

📊 Output Files

Data Files (CSV)

top_categories_per_cluster.csv - Top 10 categories by value for each cluster
discriminative_categories_per_cluster.csv - Top 10 discriminative categories (with lift scores)
presence_percentages_per_cluster.csv - % of baskets containing each category per cluster
long_table_all_values.csv - Long-format table of all values

Visualizations (PNG)

heatmap_values.png - Heatmap of all cluster × category values
bar_topN_cluster_<id>.png - Top 10 categories by composition for each cluster
bar_topN_presence_cluster_<id>.png - Top 10 categories by basket presence
bar_topN_lift_cluster_<id>.png - Top 10 discriminative categories (by lift)
comparison_cluster_<id>.png - Composition vs Presence comparison

⚙️ Configuration

Key parameters can be adjusted in scripts/kmeans_Quantity_Value.py:

# Number of clusters (set to None for auto-search)
FORCE_K = 9

# Feature calculation method
FEATURE_SET = "qv_share"  # or "qv_sum"

# Value metric to use
VALUE_COL = "Value_"  # Can be "Price", "Average_Price", "Price_Adjusted"

# K-Means search range (for auto-search)
K_MIN, K_MAX = 2, 10

# Random seed for reproducibility
RANDOM_STATE = 42

📈 Key Metrics Explained

Composition %

The percentage of total quantity×value in a cluster from each category. Sum ≈ 100% per cluster.

Presence %

The percentage of baskets in a cluster that contain at least one item from a category.

Lift

Ratio of a category's composition in a cluster vs. the global average:

Lift > 1 = Overrepresented in this cluster
Lift = 1 = Average representation
Lift < 1 = Underrepresented in this cluster

🛠️ Troubleshooting

"File not found" error:

Ensure POS_DATA_BAPR_2024-2025_updated.xlsx is in the data/ folder
Check file name spelling and extension

"Column not found" error:

Verify the Excel sheet contains required columns: Basket_ID, Barcode, Quantity, Value_, Custom Category

Path errors:

Scripts use automatic path detection. Run from the project root: python clustering.py

📝 Notes

All paths are automatically resolved relative to the project structure
Outputs are overwritten on each run (except README files)
The clustering process may take several minutes depending on data size
Matplotlib plots are generated but not displayed in batch mode

Project Version: 1.0
Last Updated: November 2025

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Personal_Assignment		Personal_Assignment
data		data
outputs		outputs
results		results
scripts		scripts
scripts_to_be_deleted		scripts_to_be_deleted
.gitignore		.gitignore
README.md		README.md
Todos.md		Todos.md
add_custom_category.py		add_custom_category.py
clustering.py		clustering.py
customer_clustering.py		customer_clustering.py
data_cleaning.py		data_cleaning.py
main.py		main.py
run_kmeans_on_customer_features.py		run_kmeans_on_customer_features.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Business Analytics and Personalization Technologies

📋 Table of Contents

🎯 Overview

📁 Project Structure

✅ Prerequisites

Installation

🚀 Usage

Quick Start

Step-by-Step Execution

🔄 Pipeline Overview

Step 1: Custom Category Processing

Step 2: K-Means Clustering

Step 3: Cluster Analysis

📊 Output Files

Data Files (CSV)

Visualizations (PNG)

⚙️ Configuration

📈 Key Metrics Explained

Composition %

Presence %

Lift

🛠️ Troubleshooting

📝 Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Business Analytics and Personalization Technologies

📋 Table of Contents

🎯 Overview

📁 Project Structure

✅ Prerequisites

Installation

🚀 Usage

Quick Start

Step-by-Step Execution

🔄 Pipeline Overview

Step 1: Custom Category Processing

Step 2: K-Means Clustering

Step 3: Cluster Analysis

📊 Output Files

Data Files (CSV)

Visualizations (PNG)

⚙️ Configuration

📈 Key Metrics Explained

Composition %

Presence %

Lift

🛠️ Troubleshooting

📝 Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages