# KMeans Clustering

### Imports

In [1]:
import pandas as pd
import duckdb
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
import numpy as np

## Clustering Data Preparation

### Dataset Configuration
- **Source**: Unified dataset containing joined review/metadata  
- **Scope**: Products with valid ratings (1-5 stars), brand, category and product category
- **Structure**: Aggregated at product level (`asin`)  

---

### Feature Engineering Query
```sql
SELECT
    asin,                   -- Product ID
    main_category,          -- Top-level category
    brand,                  -- Manufacturer
    AVG(rating) AS mean_rating,  -- Average product rating
    COUNT(*) AS total_reviews   -- Review volume
FROM read_parquet('{DATASET_PATH}')
WHERE 
    rating BETWEEN 1 AND 5  -- Valid Rating
    AND asin IS NOT NULL    -- Ensure product ID exists
    AND main_category IS NOT NULL -- Ensure catefory Exists
    AND brand IS NOT NULL -- Ensure Brand Exists
GROUP BY asin, main_category, brand

In [None]:
DATASET_PATH = "G:/My Drive/unified_dataset/**/*.parquet"

query = f"""
    SELECT
        asin,
        main_category,
        brand,
        AVG(rating) AS mean_rating,
        COUNT(*) AS total_reviews
    FROM read_parquet('{DATASET_PATH}')
    WHERE rating BETWEEN 1 AND 5
      AND asin IS NOT NULL
      AND main_category IS NOT NULL
      AND brand IS NOT NULL
    GROUP BY asin, main_category, brand
"""

con = duckdb.connect()
df = con.execute(query).fetch_df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

## 🧮 K-means Clustering Implementation

### Feature Preprocessing
**Categorical Encoding**:
```python
# Convert brand names to numerical IDs
df['brand_id'] = LabelEncoder().fit_transform(df['brand'])

# Convert categories to numerical IDs
df['category_id'] = LabelEncoder().fit_transform(df['main_category'])

In [4]:
df['brand_id'] = LabelEncoder().fit_transform(df['brand'])
df['category_id'] = LabelEncoder().fit_transform(df['main_category'])

features = df[['mean_rating', 'total_reviews', 'brand_id', 'category_id']]

kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(features)

cluster_summary = df.groupby('cluster').agg({
    'asin': 'count',
    'mean_rating': 'mean',
    'total_reviews': 'mean',
    'brand_id': 'mean',
    'category_id': 'mean'
}).rename(columns={'asin': 'cluster_size'})

print(cluster_summary)

         cluster_size  mean_rating  total_reviews      brand_id  category_id
cluster                                                                     
0             6813422     4.164541      10.256586  2.560370e+06    14.231206
1            10068099     4.141365      10.830879  4.734996e+05    13.541924
2            10250818     4.090625      10.565454  4.429793e+06    15.515071
3             9691205     4.129450      10.246752  1.503145e+06    14.733325
4             9494661     4.137608      11.019385  3.466730e+06    14.334762


### Observations
1. Relatively **Balanced Cluser sizes** (6-10M products)
2. **Rating Stability**
   * All clusters have a relatively high rating (4.1- 4.2)
   * Difference in rating of at most 0.08
4. **Brand and Category**
   * Cluster 0, 3, 4 all share a similar average category but differ based on brand average. This may be because they cluster items in similar categories but split it into differnt brand prestiges.