The steps for k-means clustering are: 
Choose the number of clusters: Decide how many clusters to create, or the value of k. This can be done randomly or using a method like Elbow or Silhouette. 
Assign centroids: Randomly select k points from the dataset to be the initial centroids for each cluster. 
Calculate distances: Calculate the distance between each data point and each centroid. 
Assign observations: Assign each data point to the closest centroid. 
Update centroids: Find the new location of each centroid by taking the mean of all the observations in that cluster. 
Repeat: Repeat steps 3–5 until the centroids no longer change position. 


The output of k-means cluster analysis includes:
A table of the mean values of each cluster on the clustering variables
Which object has been classified into which cluster
Plots and diagnostics to assess variation within and between clusters 


K-means clustering is used in a variety of applications, including document classification, image segmentation, and recommendation engines.

In [None]:
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans

In [None]:
# reading the data and looking at the first five rows of the data
data=pd.read_csv(r"whole_sale_customers_data.csv")
data.head()

In [None]:
data['Region'].nunique()

In [None]:
# statistics of the data
data.describe()

In [None]:
# standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# statistics of scaled data
pd.DataFrame(data_scaled).describe()

In [None]:
# defining the kmeans function with initialization as k-means++
kmeans = KMeans(n_clusters=2, init='k-means++')

# fitting the k means algorithm on scaled data
kmeans.fit(data_scaled)

In the k-means clustering algorithm, inertia is a metric that measures how well a dataset has been clustered. It's calculated by measuring the distance between each data point and its centroid, squaring the distance, and summing those squares for each data point in the cluster. 
A lower inertia value indicates that the data points within the cluster are more compact or similar, and that the clusters are well-formed. A higher inertia value indicates that the data points are spread out and far from their centroids, and that the clustering is less optimal. 
The k-means algorithm aims to choose centroids that minimize inertia. A good model has low inertia and a low number of clusters (K), but there is a tradeoff because as K increases, inertia decreases. To find the optimal K for a dataset, you can use the Elbow method, which identifies the point where the decrease in inertia begins to slow. 


In [None]:
# inertia on the fitted data
kmeans.inertia_

We got an inertia value of almost 2600. Now, let’s see how we can use the elbow curve to determine the optimum number of clusters in Python.

We will first fit multiple k-means models and in each successive model, we will increase the number of clusters. We will store the inertia value of each model and then plot it to visualize the result:

In [None]:
# fitting multiple k-means algorithms and storing the values in an empty list
SSE = []
for cluster in range(1,20):
    kmeans = KMeans(n_clusters=cluster, init='k-means++')
    kmeans.fit(data_scaled)
    SSE.append(kmeans.inertia_)

# converting the results into a dataframe and plotting them
frame = pd.DataFrame({'Cluster':range(1,20), 'SSE':SSE})
plt.figure(figsize=(12,6))
plt.plot(frame['Cluster'], frame['SSE'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
# k means using 5 clusters and k-means++ initialization
kmeans = KMeans(n_clusters = 7, init='k-means++')
kmeans.fit(data_scaled)
pred = kmeans.predict(data_scaled)

In [None]:
kmeans.inertia_

In [None]:
frame = pd.DataFrame(data_scaled)
frame['cluster'] = pred
frame['cluster'].value_counts()

So, there are 234 data points belonging to cluster 4 (index 3), then 125 points in cluster 2 (index 1), and so on.

### Silhoutte Score

To determine the optimal number of clusters, you can calculate the silhouette score for different values of k and choose the one with the highest average silhouette score. You can also visualize the silhouette analysis using a silhouette plot, which can help identify clusters with low silhouette scores.

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
silhouette_score(data_scaled, kmeans.labels_)

### K-Means for Image Segmentation

In [None]:
img = r"ladybug.png"

In [None]:
from matplotlib.image import imread
image = imread(img)
image.shape

In [None]:
plt.imshow(image)
plt.show()

In [None]:
X = image.reshape(-1, 3)
kmeans = KMeans(n_clusters=8, random_state=42).fit(X)
segmented_img = kmeans.cluster_centers_
len(kmeans.labels_)

In [None]:
kmeans.labels_

In [None]:
segmented_img = kmeans.cluster_centers_[kmeans.labels_]
segmented_img = segmented_img.reshape(image.shape)

In [None]:
segmented_imgs = []
n_colors = (10, 8, 6, 4, 2)
for n_clusters in n_colors:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    segmented_img = kmeans.cluster_centers_[kmeans.labels_]
    segmented_imgs.append(segmented_img.reshape(image.shape))

In [None]:
len(kmeans.labels_)

In [None]:
plt.figure(figsize=(10,5))
plt.subplots_adjust(wspace=0.05, hspace=0.1)

plt.subplot(231)
plt.imshow(image)
plt.title("Original image")
plt.axis('off')

for idx, n_clusters in enumerate(n_colors):
    plt.subplot(232 + idx)
    plt.imshow(segmented_imgs[idx])
    plt.title("{} colors".format(n_clusters))
    plt.axis('off')


plt.show()

You can experiment
with various numbers of clusters, as shown in the figure. When you use less than 8
clusters, notice that the ladybug’s flashy red color fails to get a cluster of its own: it
gets merged with colors from the environment. This is due to the fact that the lady‐
bug is quite small, much smaller than the rest of the image, so even though its color is
flashy, K-Means fails to dedicate a cluster to it: as mentioned earlier, K-Means prefers
clusters of similar sizes.

### Using Clustering for Pre-processing

In [None]:
from sklearn.datasets import load_digits

In [None]:
X_digits, y_digits = load_digits(return_X_y=True)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg = LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
log_reg.fit(X_train, y_train)

In [None]:
log_reg_score = log_reg.score(X_test, y_test)
log_reg_score

Okay, that's our baseline: 96.89% accuracy. Let's see if we can do better by using K-Means as a preprocessing step. We will create a pipeline that will first cluster the training set into 50 clusters and replace the images with their distances to the 50 clusters, then apply a logistic regression model:

In [None]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ("kmeans", KMeans(n_clusters=50, random_state=42)),
    ("log_reg", LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)),
])
pipeline.fit(X_train, y_train)

In [None]:
pipeline_score = pipeline.score(X_test, y_test)
pipeline_score