# Wholesale customers Data Set
Annual spending in monetary units of clients of a wholesale distributor
The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories
Source: UCI Wholesale customers Data Set

Data set link: https://www.kaggle.com/datasets/binovi/wholesale-customers-data-set


### Data Description:
**Channel**: Horeca (Hotel/Restaurant/Cafe or Retail channel (Nominal)

**Region** : Lisnon, Oporto or Other (Nominal)

**Fresh** : annual spending (m.u.) on fresh products (Continuous)

**Milk** : annual spending (m.u.) on milk products (Continuous)

**Grocery** : annual spending (m.u.)on grocery products (Continuous)

**Frozen** : annual spending (m.u.)on frozen products (Continuous)

**Detergents_Paper** : annual spending (m.u.) on detergents and paper products (Continuous)

**Delicassen** : annual spending (m.u.)on and delicatessen products (Continuous)

<br /><br />
Expected Output

By the end of this Mini Project, you should deliver within your code:

- Multiple Dunn Index measures resembling different k used for K-Means clustering your data.
- An output plot of the elbow curve.
- The best k chosen based on the elbow curve plot.
- Output predicted clusters for the first 10 data samples.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
from  joblib import dump, load
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score




In [2]:
df_raw = pd.read_csv("Data/Wholesale_Customers_Data.csv")
df_raw.head(10)

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185
5,2,3,9413,8259,5126,666,1795,1451
6,2,3,12126,3199,6975,480,3140,545
7,2,3,7579,4956,9426,1669,3321,2566
8,1,3,5963,3648,6192,425,1716,750
9,2,3,6006,11093,18881,1159,7425,2098


### Look for NaN values and drop them

In [3]:
display(df_raw.info())
print("\n")
display(df_raw.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   Channel           440 non-null    int64
 1   Region            440 non-null    int64
 2   Fresh             440 non-null    int64
 3   Milk              440 non-null    int64
 4   Grocery           440 non-null    int64
 5   Frozen            440 non-null    int64
 6   Detergents_Paper  440 non-null    int64
 7   Delicassen        440 non-null    int64
dtypes: int64(8)
memory usage: 27.6 KB


None





Channel             0
Region              0
Fresh               0
Milk                0
Grocery             0
Frozen              0
Detergents_Paper    0
Delicassen          0
dtype: int64

## Cluster the data without considering the Channel and Region 

In [4]:
X = df_raw.drop(columns=['Channel', 'Region']).values

### Define objective funcions for hyperparameter tuning

In [5]:
# Objective function for hyperparameter tuning of DecisionTree
def objective(trial, X, n_clusters, random_state):
    params = {        
    "init": trial.suggest_categorical("init", ["k-means++", "random"]),
    "tol": trial.suggest_float("tol", 1e-9, 1e9, log=True),
    "algorithm": trial.suggest_categorical("algorithm", ["lloyd", "elkan"]),
    "n_clusters": n_clusters,
    "n_init": "auto", 
    "max_iter": 1000,
    "random_state": random_state,
    }
    
    cluster_model = KMeans(**params).fit(X)
    cluster_labels  = cluster_model.labels_
    silhouette_avg = silhouette_score(X, cluster_labels)
    
    return silhouette_avg

In [7]:
random_state = 42
optuna.logging.set_verbosity(optuna.logging.INFO)
optuna.logging.set_verbosity(optuna.logging.WARNING)
best_score = 0
best_k = 2
best_trial = []

for k in range(2,21):
    study = optuna.create_study(direction = "maximize")
    func = lambda trial: objective(trial, X, k, random_state)
    study.optimize(func, n_trials = 100, timeout=600)

    silhouette_score_val = study.best_trial.value

    print("K = ", k)
    print("Silhouette Score = ", best_score)
    print("Params: ")
    for key, value in study.best_trial.params.items():
        print("    {}: {}".format(key, value))

    if silhouette_score_val > best_score:
        best_score = silhouette_score_val
        best_k = k
        best_trial = study.best_trial
        
        

K =  2
Silhouette Score =  0
Params: 
    init: random
    tol: 0.49434644084792484
    algorithm: lloyd
K =  3
Silhouette Score =  0.5557067561301402
Params: 
    init: random
    tol: 7.242727867367732e-05
    algorithm: lloyd
K =  4
Silhouette Score =  0.5557067561301402
Params: 
    init: random
    tol: 0.7557938871147519
    algorithm: elkan
K =  5
Silhouette Score =  0.5557067561301402
Params: 
    init: k-means++
    tol: 0.0006917783810538939
    algorithm: elkan
K =  6
Silhouette Score =  0.5557067561301402
Params: 
    init: random
    tol: 0.01183670806976693
    algorithm: elkan
K =  7
Silhouette Score =  0.5557067561301402
Params: 
    init: random
    tol: 4.504458716504776e-05
    algorithm: lloyd
K =  8
Silhouette Score =  0.5557067561301402
Params: 
    init: k-means++
    tol: 3.1831923805458303e-09
    algorithm: lloyd
K =  9
Silhouette Score =  0.5557067561301402
Params: 
    init: k-means++
    tol: 7.036436506134834e-09
    algorithm: elkan
K =  10
Silhouette Sco

In [8]:
print("Best K = ", best_k)
print("Best Silhouette Score = ", best_score)
print("Best Score Params: ")
for key, value in best_trial.params.items():
    print("    {}: {}".format(key, value))

Best K =  2
Best Silhouette Score =  0.5557067561301402
Best Score Params: 
    init: random
    tol: 0.49434644084792484
    algorithm: lloyd
