# 01 — Exploratory Data Analysis (EDA)

**Objective:** Understand the structure, distributions, outliers, and feature correlations  
of the e-commerce transaction dataset before clustering.

**Key questions:**
1. What do the distributions of numerical features look like? Are they skewed?
2. Are there extreme outliers that could distort clustering?
3. How many unique values do categorical features have? (cardinality)
4. Are there temporal patterns in order timing?
5. What correlations exist between features?

In [1]:
import sys
sys.path.insert(0, "..")

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import yaml

pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:.2f}".format)

## 1. Load Data

In [2]:
df = pd.read_parquet("../data/processed/orders_features.parquet")
print(f"Shape: {df.shape[0]} orders x {df.shape[1]} columns")
df.head()

Shape: 9000 orders x 16 columns


Unnamed: 0,customer_id,order_id,date_add,payment_name,payment_status,payment_paid,delivery_type,delivery_price,client_city,order_amount_brutto,n_items,avg_item_price,max_item_price,total_quantity,hour_of_day,day_of_week
0,5482dca302b5,791880451,2025-11-02 18:38:41,Allegro Finance,1,211.59,"Allegro One Box, One Kurier",0.0,Warszawa,211.59,1,211.59,211.59,1,18,6
1,69bf7121725c,791880302,2025-11-02 18:38:50,Allegro Finance,1,25.72,Allegro Wysyłka z Polski do Słowacji - Automat...,2.39,Vikartovce,23.33,1,23.33,23.33,1,18,6
2,ce21e6803616,791878538,2025-11-02 18:31:24,Allegro Finance,1,282.2,Allegro Paczkomaty InPost,0.0,Ciecierzyce,282.2,1,282.2,282.2,1,18,6
3,1d48eb8554c2,791858697,2025-11-02 17:06:59,Allegro Finance,1,110.39,Allegro Paczkomaty InPost,0.0,Serby,110.39,1,110.39,110.39,1,17,6
4,2290b78c20ce,791843871,2025-11-02 15:53:38,Przelewy24,1,399.0,Paczkomaty InPost,0.0,"Zajączki Drugie, Krzepice",399.0,1,399.0,399.0,1,15,6


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9000 entries, 0 to 8999
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   customer_id          9000 non-null   object        
 1   order_id             9000 non-null   object        
 2   date_add             9000 non-null   datetime64[us]
 3   payment_name         8879 non-null   object        
 4   payment_status       9000 non-null   int64         
 5   payment_paid         9000 non-null   float64       
 6   delivery_type        8982 non-null   object        
 7   delivery_price       9000 non-null   float64       
 8   client_city          8893 non-null   object        
 9   order_amount_brutto  9000 non-null   float64       
 10  n_items              9000 non-null   int64         
 11  avg_item_price       9000 non-null   float64       
 12  max_item_price       9000 non-null   float64       
 13  total_quantity       9000 non-nul

In [4]:
df.describe()

Unnamed: 0,date_add,payment_status,payment_paid,delivery_price,order_amount_brutto,n_items,avg_item_price,max_item_price,total_quantity,hour_of_day,day_of_week
count,9000,9000.0,9000.0,9000.0,9000.0,9000.0,9000.0,9000.0,9000.0,9000.0,9000.0
mean,2025-03-18 16:24:58.855889,0.88,194.79,3.45,226.68,1.1,210.55,214.41,1.08,14.48,2.83
min,2024-09-13 09:39:41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2024-12-04 22:27:19.750000,1.0,91.8,0.0,101.26,1.0,101.96,103.56,1.0,10.0,1.0
50%,2025-03-12 09:36:21,1.0,144.05,0.0,157.49,1.0,149.62,153.74,1.0,15.0,3.0
75%,2025-06-13 16:58:09.750000,1.0,292.27,0.0,299.0,1.0,295.2,296.1,1.0,19.0,5.0
max,2025-11-02 18:38:50,2.0,21220.0,1890.0,21220.0,7.0,21220.0,21220.0,7.0,23.0,6.0
std,,0.33,316.47,38.23,523.41,0.38,486.79,492.29,0.42,5.31,2.02


In [5]:
# Missing values
missing = df.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

payment_name     121
client_city      107
delivery_type     18
dtype: int64

## 2. Numerical Feature Distributions

Key features: `order_amount_brutto`, `n_items`, `avg_item_price`, `max_item_price`

In [6]:
NUMERICAL = ["order_amount_brutto", "n_items", "avg_item_price", "max_item_price"]

fig = make_subplots(rows=2, cols=2, subplot_titles=NUMERICAL)

for i, col in enumerate(NUMERICAL):
    row, c = divmod(i, 2)
    fig.add_trace(
        go.Histogram(x=df[col], nbinsx=50, name=col, showlegend=False),
        row=row + 1, col=c + 1,
    )

fig.update_layout(
    title="Numerical Feature Distributions (raw)",
    height=600, template="plotly_white",
)
fig.show()

In [7]:
# Skewness analysis
skewness = df[NUMERICAL].skew()
print("Skewness (>1 indicates strong right-skew):")
print(skewness.to_string())

Skewness (>1 indicates strong right-skew):
order_amount_brutto   29.72
n_items                5.19
avg_item_price        32.25
max_item_price        31.80


In [8]:
# Effect of log1p transformation on skewness
LOG_COLS = ["order_amount_brutto", "avg_item_price", "max_item_price"]

fig = make_subplots(rows=1, cols=3, subplot_titles=[f"log1p({c})" for c in LOG_COLS])

for i, col in enumerate(LOG_COLS):
    transformed = np.log1p(df[col])
    fig.add_trace(
        go.Histogram(x=transformed, nbinsx=50, name=col, showlegend=False),
        row=1, col=i + 1,
    )

fig.update_layout(
    title="Distributions After log1p Transform",
    height=350, template="plotly_white",
)
fig.show()

print("\nSkewness after log1p:")
for col in LOG_COLS:
    print(f"  {col}: {np.log1p(df[col]).skew():.3f}")


Skewness after log1p:
  order_amount_brutto: -2.626
  avg_item_price: -2.490
  max_item_price: -2.483


## 3. Outlier Detection

Identify extreme values that may create singleton clusters or distort centroids.

In [9]:
# Box plots for outlier visualization
fig = make_subplots(rows=1, cols=4, subplot_titles=NUMERICAL)

for i, col in enumerate(NUMERICAL):
    fig.add_trace(
        go.Box(y=df[col], name=col, showlegend=False),
        row=1, col=i + 1,
    )

fig.update_layout(
    title="Box Plots — Outlier Overview",
    height=400, template="plotly_white",
)
fig.show()

In [10]:
# IQR-based outlier counts
print("Outlier counts (values > Q3 + 1.5*IQR):")
for col in NUMERICAL:
    q1, q3 = df[col].quantile([0.25, 0.75])
    iqr = q3 - q1
    upper = q3 + 1.5 * iqr
    n_outliers = (df[col] > upper).sum()
    pct = 100 * n_outliers / len(df)
    print(f"  {col}: {n_outliers} ({pct:.1f}%) — threshold: {upper:.2f}")

Outlier counts (values > Q3 + 1.5*IQR):
  order_amount_brutto: 258 (2.9%) — threshold: 595.61
  n_items: 718 (8.0%) — threshold: 1.00
  avg_item_price: 82 (0.9%) — threshold: 585.06
  max_item_price: 82 (0.9%) — threshold: 584.91


In [11]:
# Top 10 most extreme orders by order_amount_brutto
top_orders = df.nlargest(10, "order_amount_brutto")[
    ["order_id", "order_amount_brutto", "n_items", "avg_item_price",
     "max_item_price", "delivery_type", "payment_name"]
]
top_orders

Unnamed: 0,order_id,order_amount_brutto,n_items,avg_item_price,max_item_price,delivery_type,payment_name
3649,750069595,21220.0,1,21220.0,21220.0,Allegro Kurier DPD Węgry,PayU
3704,749707040,20365.0,1,20365.0,20365.0,Allegro Kurier DPD Węgry,Przelewy24
1038,781119969,20170.0,1,20170.0,20170.0,Allegro Wysyłka z Polski do Węgier - Automaty ...,Płatność przy odbiorze
258,788341660,17885.0,2,8942.5,11250.0,Allegro Kurier DHL Węgry pobranie,Płatność przy odbiorze
6687,720991167,13710.0,1,13710.0,13710.0,Allegro Kurier DPD Węgry,PayU
5993,729390418,11980.0,1,11980.0,11980.0,Allegro Kurier DPD Węgry pobranie,Płatność przy odbiorze
174,789559603,11410.0,1,11410.0,11410.0,Allegro Kurier DPD Węgry pobranie,Płatność przy odbiorze
5749,732679555,10865.0,1,10865.0,10865.0,Allegro Kurier UPS Węgry pobranie,Płatność przy odbiorze
1216,779012145,8645.0,1,8645.0,8645.0,Allegro Kurier DPD Węgry pobranie,Płatność przy odbiorze
273,788187515,4187.0,2,2093.5,2170.0,Allegro International Automaty Paczkowe Czechy...,PayU


In [12]:
# Percentile analysis — where does the distribution concentrate?
percentiles = [0.50, 0.75, 0.90, 0.95, 0.99, 1.00]
pct_df = df[NUMERICAL].quantile(percentiles)
pct_df.index = [f"P{int(p*100)}" for p in percentiles]
pct_df

Unnamed: 0,order_amount_brutto,n_items,avg_item_price,max_item_price
P50,157.49,1.0,149.62,153.74
P75,299.0,1.0,295.2,296.1
P90,369.0,1.0,345.45,349.0
P95,400.03,2.0,375.13,395.13
P99,884.41,3.0,531.09,531.19
P100,21220.0,7.0,21220.0,21220.0


## 4. Categorical Feature Analysis

High cardinality in one-hot encoding inflates the feature space. Let's examine how many  
unique values `delivery_type` and `payment_name` have, and whether rare categories dominate.

In [13]:
CATEGORICAL = ["delivery_type", "payment_name"]

for col in CATEGORICAL:
    n_unique = df[col].nunique()
    print(f"\n{col}: {n_unique} unique values")
    print(f"  Top 10:")
    print(df[col].value_counts().head(10).to_string())


delivery_type: 89 unique values
  Top 10:
delivery_type
Allegro Paczkomaty InPost                      4215
Paczkomaty InPost                              1204
Kurier DPD                                      509
Allegro One Box, One Kurier                     436
InPost Paczkomaty 24/7 - wszystkie rozmiary     302
DPD                                             294
Allegro Automat ORLEN Paczka                    251
Kurier DPD pobranie                             196
ORLEN Paczka wszystkie rozmiary                 164
Allegro Kurier DPD                              140

payment_name: 29 unique values
  Top 10:
payment_name
Przelewy24                          4871
PayU                                1645
IdoPay                               593
Allegro Finance                      515
Płatność przy odbiorze               269
Przelewy24 - Allegro Finance         197
Check / Money order                  181
Płatność kurierowi przy odbiorze     164
pobranie                             134

In [14]:
# Cumulative frequency — what % of orders do the top N categories cover?
fig = make_subplots(rows=1, cols=2, subplot_titles=CATEGORICAL)

for i, col in enumerate(CATEGORICAL):
    counts = df[col].value_counts().sort_values(ascending=False)
    cum_pct = 100 * counts.cumsum() / counts.sum()
    fig.add_trace(
        go.Scatter(
            x=list(range(1, len(cum_pct) + 1)),
            y=cum_pct.values,
            mode="lines+markers",
            name=col,
            showlegend=False,
        ),
        row=1, col=i + 1,
    )
    fig.add_hline(y=95, line_dash="dash", line_color="red", row=1, col=i + 1)
    fig.update_xaxes(title_text="# of categories", row=1, col=i + 1)
    fig.update_yaxes(title_text="Cumulative %", row=1, col=i + 1)

fig.update_layout(
    title="Cumulative Frequency by Category Count (red = 95%)",
    height=400, template="plotly_white",
)
fig.show()

for col in CATEGORICAL:
    counts = df[col].value_counts()
    cum_pct = 100 * counts.cumsum() / counts.sum()
    n_for_95 = (cum_pct <= 95).sum() + 1
    print(f"{col}: top {n_for_95} categories cover 95% of orders (out of {len(counts)} total)")

delivery_type: top 22 categories cover 95% of orders (out of 89 total)
payment_name: top 9 categories cover 95% of orders (out of 29 total)


In [15]:
# Categories with fewer than N occurrences
for threshold in [10, 50, 100]:
    print(f"\nCategories with < {threshold} occurrences:")
    for col in CATEGORICAL:
        counts = df[col].value_counts()
        rare = (counts < threshold).sum()
        rare_pct = 100 * df[df[col].isin(counts[counts < threshold].index)].shape[0] / len(df)
        print(f"  {col}: {rare} categories ({rare_pct:.1f}% of orders)")


Categories with < 10 occurrences:
  delivery_type: 55 categories (2.1% of orders)
  payment_name: 14 categories (0.2% of orders)

Categories with < 50 occurrences:
  delivery_type: 70 categories (6.3% of orders)
  payment_name: 17 categories (1.1% of orders)

Categories with < 100 occurrences:
  delivery_type: 77 categories (11.3% of orders)
  payment_name: 20 categories (3.4% of orders)


## 5. Temporal Patterns

Order timing features: `hour_of_day`, `day_of_week`

In [16]:
fig = make_subplots(rows=1, cols=2, subplot_titles=["Orders by Hour", "Orders by Day of Week"])

hour_counts = df["hour_of_day"].value_counts().sort_index()
fig.add_trace(
    go.Bar(x=hour_counts.index, y=hour_counts.values, name="Hour", showlegend=False),
    row=1, col=1,
)

day_names = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
day_counts = df["day_of_week"].value_counts().sort_index()
fig.add_trace(
    go.Bar(x=day_names, y=day_counts.values, name="Day", showlegend=False),
    row=1, col=2,
)

fig.update_layout(
    title="Temporal Distribution of Orders",
    height=400, template="plotly_white",
)
fig.show()

In [17]:
# Average order value by hour and day
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=["Avg Order Value by Hour", "Avg Order Value by Day"],
)

avg_by_hour = df.groupby("hour_of_day")["order_amount_brutto"].mean()
fig.add_trace(
    go.Scatter(
        x=avg_by_hour.index, y=avg_by_hour.values,
        mode="lines+markers", name="Hour", showlegend=False,
    ),
    row=1, col=1,
)

avg_by_day = df.groupby("day_of_week")["order_amount_brutto"].mean()
fig.add_trace(
    go.Bar(x=day_names, y=avg_by_day.values, name="Day", showlegend=False),
    row=1, col=2,
)

fig.update_layout(
    title="Order Value by Time Dimensions",
    height=400, template="plotly_white",
)
fig.show()

## 6. Feature Correlations

In [18]:
from src.visualization import plot_correlation_heatmap

fig = plot_correlation_heatmap(df, NUMERICAL)
fig.show()

In [19]:
# Pairwise scatter matrix of numerical features
fig = px.scatter_matrix(
    df[NUMERICAL],
    dimensions=NUMERICAL,
    title="Pairwise Scatter Matrix",
    opacity=0.3,
    height=700,
    width=700,
)
fig.update_traces(diagonal_visible=False, marker=dict(size=2))
fig.update_layout(template="plotly_white")
fig.show()

## 7. Feature Space Dimensionality

Examine the full feature matrix produced by the pipeline and how PCA captures variance.

In [20]:
from src.feature_engineering import FeaturePipeline
from sklearn.decomposition import PCA

with open("../config.yaml") as f:
    config = yaml.safe_load(f)

pipeline = FeaturePipeline.from_config(config["features"])
X_scaled = pipeline.fit_transform(df)

print(f"Feature matrix shape: {X_scaled.shape}")
print(f"  Numerical:  {len(pipeline.numerical_cols)}")
print(f"  Cyclical:   {len(pipeline.cyclical_config)} x 2 = {len(pipeline.cyclical_config) * 2}")
print(f"  Categorical one-hot: {X_scaled.shape[1] - len(pipeline.numerical_cols) - len(pipeline.cyclical_config) * 2}")

Feature matrix shape: (9000, 126)
  Numerical:  4
  Cyclical:   2 x 2 = 4
  Categorical one-hot: 118


In [21]:
# PCA explained variance — how many components needed for X% variance?
pca_full = PCA().fit(X_scaled)
cum_var = np.cumsum(pca_full.explained_variance_ratio_) * 100

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(range(1, len(cum_var) + 1)),
    y=cum_var,
    mode="lines",
    name="Cumulative variance",
))

for threshold in [50, 80, 90, 95]:
    n_comp = int(np.searchsorted(cum_var, threshold) + 1)
    fig.add_hline(y=threshold, line_dash="dot", line_color="grey")
    fig.add_annotation(x=n_comp, y=threshold, text=f"{threshold}% → {n_comp} PCs")

fig.update_layout(
    title="PCA Cumulative Explained Variance (full feature set)",
    xaxis_title="Number of Components",
    yaxis_title="Cumulative Variance (%)",
    height=450, template="plotly_white",
)
fig.show()

for pct in [50, 80, 90, 95]:
    n = int(np.searchsorted(cum_var, pct) + 1)
    print(f"  {pct}% variance: {n} components")

  50% variance: 46 components
  80% variance: 84 components
  90% variance: 97 components
  95% variance: 104 components


## 8. Key Findings Summary

| Finding | Details | Impact on Clustering |
|---------|---------|---------------------|
| **Right-skewed monetaries** | `order_amount_brutto` has extreme right tail | `log1p` needed; outliers may form singleton clusters |
| **Extreme outlier** | Max order ~21,220 PLN vs median ~172 PLN | May create a singleton cluster; consider capping or separate treatment |
| **High cardinality** | `delivery_type` has ~89 unique values | 118 one-hot columns dominate the feature space |
| **Rare categories** | Many categories appear < 10 times | Grouping rare into "Other" will reduce noise |
| **Low PCA variance** | 3 PCA components capture only ~6% variance | One-hot dimensions spread variance thinly |
| **Temporal patterns** | Clear hour/day patterns in ordering | sin/cos encoding captures this well |
| **Correlated features** | `avg_item_price` ↔ `max_item_price` high correlation | Possible redundancy; PCA will handle |

**Recommendation for Notebook 02:**  
1. Group rare categorical values (< N occurrences) into "Other"  
2. Experiment with numerical-only and numerical+cyclical feature sets  
3. Tune DBSCAN eps using k-distance plot  
4. Use t-SNE for better 2D separation visualization