# Hierarquical clustering of product catalogs using title sizes

The problem we aim to solve here is to cluster product catalogs into higher levels of catalogs. By nature, Vinted already has an hierarchical sequence of natural catalogs. For instance, the primary labels are: Woman, Men, Children, House, Animals, Others

- Woman -> Bras, dresses, leggings, pantyhoses, as well as common clothes (trousers, shirts, shoes, etc)
- Men -> Shoes, shorts, t shirts, shirts, etc
- Children -> same as above
- House -> blankets, towels, pillows, decoration, etc
- Animals -> Toys, grooming, training, etc
- Entertainment -> ...

In addition, this hierarchies have 4 to 5 layers of depth.

- Men -> Shoes -> Sneakers -> Sports sneakers
- Men -> Clothing -> T shirts and tops -> T shirts -> Simple T shirts

## Product catalogs and title sizes

After skimming through the data, I realized the best segmentation feature here is the title sizes. Each category is often linked to a specific size title. Sometimes its very obvious (shoes vs clothing), specially in higher level segmentation; sometimes it can be trickier (women shoes vs men shoes).

As an example:
- M, XL, S, etc which are the standard clothing sizes 
    - But sizes differ between men and women
- no_size for the products which do not have size labels
    - Impossible to distinguish no_size title based on sizing alone
- numeric sizes (40, 41) which refers to footwear
- kids sizes (8 anos, 12 anos, etc)
- cup sizes (95D, 80B)
- jewelry (which im assuming 30 mm diameter refers to)
- volume units (40 x 45 cm)
- weight units (9-18kgs)
- jeans sizes (waist measurements W33)
    - But sizes differ between men and women



In [49]:
import pandas as pd
from sqlalchemy import create_engine
import os
import json
import matplotlib.pyplot as plt
#from sklearn.decomposition import PCA

def load_credentials(path = "aws_rds_credentials.json"):
     with open(path, 'r') as file:
          config = json.load(file)

     # set up credentials
     for key in config.keys():
          os.environ[key] = config[key]

     return

time_interval = 90 #days

load_credentials()

aws_rds_url = f"postgresql://{os.environ['user']}:{os.environ['password']}@{os.environ['host']}:{os.environ['port']}/{os.environ['database']}?sslmode=require"

engine = create_engine(aws_rds_url)
sql_query = f"""SELECT brand_title, price_numeric, catalog_id, size_title
               FROM public.tracking_staging 
               WHERE date >= CURRENT_DATE - INTERVAL '{time_interval} days'
               """
data = pd.read_sql(sql_query, engine)
data

Unnamed: 0,brand_title,price_numeric,catalog_id,size_title
0,Amazon,9.0,2007.0,
1,Aliexpress,15.0,2662.0,47
2,NoName,60.0,1782.0,L / 40 / 12
3,NoName,60.0,1782.0,M / 38 / 10
4,sansnom.,60.0,1782.0,M / 38 / 10
...,...,...,...,...
720994,findci,12.0,1204.0,S
720995,Zalando,28.0,1056.0,XL / 42 / 14
720996,Amazon,10.0,1779.0,L / 40 / 12
720997,Shein,13.0,1825.0,XL


### Dataset

The dataset used is the Vinted dataset. For this purpose, we are selecting only these variables.

| Variable           | Description       | Range/Unit            |
|--------------------|-------------------|-----------------------|
| brand_title        | Brand           | Char       |
| price_numeric      | Base price           | Numeric    |
| catalog_id         | Catalog       | Numeric (category) |
| size_title         | Size        | Char                  |

In [50]:
data.describe()

Unnamed: 0,price_numeric,catalog_id
count,720975.0,720975.0
mean,17.315433,1348.03433
std,46.212524,861.074429
min,1.0,11.0
25%,4.0,529.0
50%,7.0,1542.0
75%,15.0,1846.0
max,3500.0,2970.0


In [51]:
data.isna().sum()

brand_title      24
price_numeric    24
catalog_id       24
size_title       24
dtype: int64

In [52]:
data[["brand_title", "size_title"]].nunique()

brand_title    9326
size_title      303
dtype: int64

### Data preparation

Before we start, we need to clean and reshape the data to a preferred format. For this purpose we are using only the sizes as clustering dimensions, because we understand different sizing plays a huge role in product catalogs and reflects other latent variables such as gender, age and type.


The most reasonable metric here is count of each product per size title since price is subjective and affected by other factors. If we were to chose median price or volume, it could be more easily affected by exegenous variables.

In [53]:
data.dropna(axis = 0, inplace= True)
data

Unnamed: 0,brand_title,price_numeric,catalog_id,size_title
0,Amazon,9.0,2007.0,
1,Aliexpress,15.0,2662.0,47
2,NoName,60.0,1782.0,L / 40 / 12
3,NoName,60.0,1782.0,M / 38 / 10
4,sansnom.,60.0,1782.0,M / 38 / 10
...,...,...,...,...
720994,findci,12.0,1204.0,S
720995,Zalando,28.0,1056.0,XL / 42 / 14
720996,Amazon,10.0,1779.0,L / 40 / 12
720997,Shein,13.0,1825.0,XL


In [54]:
# using product count as metric
pivot_size = data.pivot_table(values='price_numeric', columns='size_title', index='catalog_id', aggfunc='count')
pivot_size

size_title,Unnamed: 1_level_0,0-13 kg,"0-3 meses, 30 cm","1-2 anos, 49 cm",1-3 meses / 56 cm,10,10 anos / 140 cm,100 cm,100 x 150 cm,100B,100E,105 cm,11 anos / 146 cm,11-25 kg,110 cm,115 cm,12,12 anos / 152 cm,12 ou mais > 56 cm,12-18 meses / 80 cm,120 cm,120 x 160 cm,125 cm,125 x 150 cm,13 anos / 158 cm,13-15,130 x 170 cm,14 anos / 164 cm,"14,1 mm Ø / 4","14,9 mm Ø / 6,5",15 anos / 170 cm,15 ou inferior,"15,3 mm Ø / 8","15,7 mm Ø / 9,5",15-36 kg,150 cm - 199 cm,150 x 200 cm,16,16 anos / 176 cm,"16,1 mm Ø / 10,5",...,PT 38 | W28,PT 38 | W29,PT 40 | W30,PT 40 | W31,PT 42 | W32,PT 42 | W33,PT 44 | W34,PT 44 | W35,PT 46 | W36,PT 48 | W38,PT 50 | W40,PT 52 | W42,PT 54 | W44,PT 56 | W46,PT 58 | W48,PT 60 | W50,PT 62 | W52,PT 64 | W54,"Prematuro, até 44 cm","Prematuros, 30 cm",Qualquer,"Recém-nascidos, 44 cm",S,S / 36 / 8,S | 35-38,S | 38-42,Solteiro (135-150 cm x 200-220 cm),Tamanho único,XL,XL / 42 / 14,XS,XS / 34 / 6,XS | 36-37,XXL,XXL / 44 / 16,XXS,XXS / 32 / 4,XXXL,XXXL / 46 / 18,XXXS / 30 / 2
catalog_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,659.0,,,,45.0,,119.0,,442.0,,,74.0,,21.0,,71.0,45.0
14.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,570.0,,,,7.0,,207.0,,391.0,,,53.0,,6.0,,10.0,
16.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
18.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,1131.0,,,,361.0,,316.0,,223.0,,,63.0,,9.0,,5.0,2.0
19.0,469.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2965.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.0,,,,,,,,,,,,
2967.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2968.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2969.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [55]:
from sklearn.preprocessing import MinMaxScaler

# for each catalog, we are going to use minmaxscalar to standardize the scale of the data

pivot_combined = pivot_size.fillna(0)
pivot_combined = pivot_combined.T
for col in pivot_combined.columns:
    pivot_combined[col] = MinMaxScaler().fit_transform(X = pivot_combined[[col]]) #/pivot_combined[col].sum()
pivot_combined 

catalog_id,11.0,14.0,16.0,18.0,19.0,20.0,22.0,26.0,83.0,84.0,86.0,87.0,88.0,89.0,90.0,91.0,92.0,94.0,96.0,97.0,98.0,99.0,119.0,120.0,123.0,124.0,140.0,141.0,143.0,145.0,152.0,153.0,156.0,157.0,158.0,159.0,160.0,161.0,162.0,163.0,...,2920.0,2921.0,2922.0,2923.0,2924.0,2925.0,2927.0,2928.0,2929.0,2931.0,2932.0,2933.0,2934.0,2937.0,2938.0,2939.0,2940.0,2941.0,2942.0,2944.0,2945.0,2949.0,2950.0,2951.0,2952.0,2953.0,2954.0,2955.0,2956.0,2958.0,2959.0,2960.0,2961.0,2962.0,2964.0,2965.0,2967.0,2968.0,2969.0,2970.0
size_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
,0.000000,0.000000,0.0,0.000000,1.0,0.551724,0.226744,1.0,0.000000,0.0,0.237013,1.0,0.732719,1.0,0.089888,0.0,0.000000,1.0,0.11194,0.593272,1.0,1.0,0.0,0.000000,0.000000,0.000000,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.000000,0.000000,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
0-13 kg,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"0-3 meses, 30 cm",0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"1-2 anos, 49 cm",0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1-3 meses / 56 cm,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XXS,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
XXS / 32 / 4,0.031866,0.010526,0.0,0.007958,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.0,0.0,0.000000,0.015322,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.001832,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
XXXL,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.076389,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.377778,0.0,0.00000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
XXXL / 46 / 18,0.107739,0.017544,0.0,0.004421,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.0,0.0,0.007092,0.010215,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.078755,0.653846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Aglomerative Hierarchical Clustering using Ward Linkage

As indicated by the term hierarchical, the method seeks to build clusters based on hierarchy. Generally, there are two types of clustering strategies: **Agglomerative** and **Divisive**. Here, we mainly focus on the agglomerative approach, which can be easily pictured as a ‘bottom-up’ algorithm.

### Ward Linkage Method

There are four methods for combining clusters in agglomerative approach. The one we choose to use is called Ward’s Method. Unlike the others. Instead of measuring the distance directly, it analyzes the variance of clusters. Ward’s is said to be the most suitable method for quantitative variables.
    
$\Delta(A,B) = \sum_{i\in A \bigcup B} ||\overrightarrow{x_i} - \overrightarrow{m}_{A \bigcup B}||^2 - \sum_{i \in A}||\overrightarrow{x_i} - \overrightarrow{m}_A||^2 -\sum_{i \in B}||\overrightarrow{x_i}- \overrightarrow{m}_B||^2 
= \frac{n_An_B}{n_A+n_B} ||\overrightarrow{m}_A- \overrightarrow{m}_B||^2$

where $\overrightarrow{m}_j$ is the center of cluster j, and $n_j$ is the number of points in it. Δ is called the merging cost of combining the clusters A and B. With hierarchical clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters. Ward’s method keeps this growth as small as possible.

The euclidean distance is the straight line distance between two points in Euclidean Space.

$d(p,q) = \sqrt{(q_1 -p_1)^2 + (q_2 - p_2)^2 + \cdots + (q_n -p_n)^2} = \sqrt{\sum_{i=1} (q_i-p_i)^2}$

In [56]:
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go

# dist_matrix is long form distance matrix
# notice this is the intracluster SSE
def intracluster_sse(df: pd.DataFrame) -> float:
    avg = df.mean(axis = 1)
    sse = (df.T - avg).pow(2).sum().sum()
    return (sse)

linkage_matrix = linkage(pivot_combined.T, 'ward')
t_values = np.arange(1, 10.1, 0.1)
avg_sse = []
clusters = []
for t in t_values:
    res = fcluster(linkage_matrix, criterion = "distance", t = t)
    df = pd.DataFrame(res, index = pivot_combined.columns).reset_index()
    grouped_data = df.groupby(0)['catalog_id'].apply(list).reset_index()
    sse_list = []
    clusters.append(len(grouped_data))
    for catalog in grouped_data["catalog_id"]:
        sse_list.append(intracluster_sse(pivot_combined[catalog]))
    avg_sse.append(sum(sse_list))

fig = go.Figure()
fig.add_trace(go.Scatter(x=t_values, y=clusters, mode='lines+markers'))
fig.update_xaxes(title_text="Distance between cluster")
fig.update_yaxes(title_text="Number of clusters")
fig.update_layout(title="Elbow Method Chart - Clusters")

In [57]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=t_values, y=avg_sse, mode='lines+markers'))
fig.update_xaxes(title_text="Distance between cluster")
fig.update_yaxes(title_text="Mean intra cluster SSE")
fig.update_layout(title="Elbow Method Chart - Mean SSE")

In [73]:
t = 2.5

res = fcluster(linkage_matrix, criterion = "distance", t = t)
df = pd.DataFrame(res, index = pivot_combined.columns).reset_index()
print(df)

     catalog_id   0
0          19.0   3
1          26.0  22
2          87.0   7
3          89.0   3
4          94.0   3
..          ...  ..
295      2959.0   3
296      2960.0  16
297      2961.0  16
298      2962.0  16
299      2967.0   3

[300 rows x 2 columns]


### Analysis of Sizing Features

In order to understand how to clustering was done, we are taking a look at each cluster top sizes as well as size dimension.

The result has 25 clusters and the max cluster has 135 labels.

In [59]:
grouped_data = df.groupby(0)['catalog_id'].apply(list).reset_index()
grouped_data['catalog_id_length'] = grouped_data['catalog_id'].apply(lambda x: len(x))
grouped_data

Unnamed: 0,0,catalog_id,catalog_id_length
0,1,"[19.0, 26.0, 87.0, 89.0, 94.0, 98.0, 99.0, 140...",300
1,2,"[120.0, 124.0, 229.0, 1100.0, 1103.0, 1178.0, ...",10
2,3,"[1030.0, 1067.0, 1126.0, 1263.0, 1445.0, 1773....",9
3,4,"[123.0, 192.0, 193.0, 195.0, 203.0, 220.0, 526...",22
4,5,"[11.0, 14.0, 18.0, 176.0, 178.0, 179.0, 184.0,...",98
5,6,"[259.0, 261.0, 263.0, 1787.0, 1816.0, 1817.0, ...",11
6,7,"[1790.0, 1830.0, 2121.0, 2124.0, 2552.0, 2912....",7
7,8,"[1206.0, 1474.0, 1617.0, 1618.0, 1829.0, 1861....",7
8,9,"[92.0, 260.0, 264.0, 586.0, 1224.0, 1226.0, 12...",22
9,10,"[83.0, 84.0, 265.0, 266.0, 267.0, 268.0, 271.0...",45


In [60]:
for catalog in grouped_data["catalog_id"]:
    top_10 = pivot_combined[catalog].sum(axis=1).sort_values(ascending=False).head(10)
    
    # Create a bar chart
    fig = go.Figure([go.Bar(x=top_10.index, y=top_10.values)])
    fig.update_layout(title=f"Top 10 for Catalog {catalog}",
                      xaxis_title="Index",
                      yaxis_title="Sum",
                      )
    fig.show()

### Visualizing the clustering on 2 dimensions using PCA

In [83]:
from sklearn.decomposition import PCA
import pandas as pd

# Assuming df is your DataFrame with the data you want to reduce

# Instantiate PCA with 2 components
pca = PCA(n_components=2)

# Fit and transform the data
pca_result = pca.fit_transform(pivot_combined.T)

# Convert the result to a DataFrame
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'], index = pivot_combined.T.index)
concat_df = pd.concat([pca_df, df.set_index("catalog_id")], ignore_index= False, axis=1)
concat_df = concat_df.rename(columns={0: 'Cluster'})
concat_df


In [85]:
import plotly.express as px

# Get the percentage variance explained by each principal component
explained_var_ratio = pca.explained_variance_ratio_

# Create a scatter plot with Plotly
fig = px.scatter(concat_df, x='PC1', y='PC2', color='Cluster', title='PCA Scatter Plot')

# Add the percentage variance explained to the axis labels
fig.update_layout(
    xaxis_title=f'PC1 ({explained_var_ratio[0]*100:.2f}% Variance Explained)',
    yaxis_title=f'PC2 ({explained_var_ratio[1]*100:.2f}% Variance Explained)'
)