**DS 3010: Applied Data Modeling and Predictive Analysis**

# Lab 08 – K-means

**Instructions:**
1. Load Wholesale Customer from the https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv
2. Exclude the 'Channel' and 'Region' columns as they are categorical.
3. Preprocess by scaling the features.
4. Use a default KMeans model to fit the dataset, then use silhouette analysis to score the model.
5. Try different n_clusters to see how performance change.
6. Select the best model based on silhouette score (The higher the better).

### Task 1. Load and preprocess the dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"
data = pd.read_csv(url)
data = data.drop(['Channel', 'Region'], axis=1)

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
print(data_scaled)

[[ 0.05293319  0.52356777 -0.04111489 -0.58936716 -0.04356873 -0.06633906]
 [-0.39130197  0.54445767  0.17031835 -0.27013618  0.08640684  0.08915105]
 [-0.44702926  0.40853771 -0.0281571  -0.13753572  0.13323164  2.24329255]
 ...
 [ 0.20032554  1.31467078  2.34838631 -0.54337975  2.51121768  0.12145607]
 [-0.13538389 -0.51753572 -0.60251388 -0.41944059 -0.56977032  0.21304614]
 [-0.72930698 -0.5559243  -0.57322717 -0.62009417 -0.50488752 -0.52286938]]


### Task 2. Apply KMeans model and use silhouette score to evaluate the model

In [2]:
kmeans = KMeans(random_state=42)
clusters = kmeans.fit_predict(data_scaled)
silhouette_avg = silhouette_score(data_scaled, clusters)
print(f"The silhouette score is {silhouette_avg}")

The silhouette score is 0.32366546538940916


### Task 3. Try different n_clusters and select the best one. 

In [3]:
best_score = 0.0
best_n_clusters = None

for n_clusters in range(2, 10):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(data_scaled)
    silhouette_avg = silhouette_score(data_scaled, clusters)
    print(f"The silhouette score is {silhouette_avg}")
    if silhouette_avg > best_score:
        best_score = silhouette_avg
        best_n_clusters = n_clusters
print(f"The best n clusters is {best_n_clusters}, the silhouette score is {best_score}")

The silhouette score is 0.3998278091730005
The silhouette score is 0.4582633767207058
The silhouette score is 0.34939129340421093
The silhouette score is 0.36890127429678055
The silhouette score is 0.2762464573058837
The silhouette score is 0.276678268663421
The silhouette score is 0.32366546538940916
The silhouette score is 0.29453704649783113
The best n clusters is 3, the silhouette score is 0.4582633767207058
