# Kmenas

## Definition

K-means is a popular clustering algorithm used in data analysis and machine learning. It aims to partition a set of observations into $ K $ clusters, with each observation belonging to the cluster with the nearest mean.

## Assumptions

1. **Spherical Clusters**: K-means assumes that the clusters are spherical and roughly of the same size, meaning that a cluster's variance is the same in all directions. This assumption helps in defining the boundaries of the clusters.

2. **Similar Variance**: Each cluster is assumed to have roughly equal variance, meaning the spread of clusters is similar. If the variance is significantly different, K-means may struggle to identify the actual clusters correctly.

3. **Clusters are Separable and Non-hierarchical**: The algorithm works best when the clusters are separable and non-hierarchical, which means it doesn't assume any relationship between the clusters.

4. **Centroid Represents the Mean**: It's assumed that the centroid of a cluster accurately represents the mean of all points in the cluster.

5. **Number of Clusters ($ K $)**: One of the main limitations of K-means is that you need to specify the number of clusters ($ K $) beforehand.

## Algorithm

Absolutely, let's integrate K-means++ into the detailed explanation of the K-means algorithm:

### K-means Algorithm with K-means++ Initialization

#### Step 1: Initialization with K-means++

1. **Select the First Centroid**: Randomly pick the first centroid $ \mu_1 $ from the data points.

2. **Select Subsequent Centroids**: For each next centroid $ \mu_k $ (where $ k = 2, 3, \ldots, K $):
   
   a. Calculate the distance $ D(x) $ for each data point $ x $, which is the shortest distance from $ x $ to any of the already chosen centroids. Mathematically, this is:
      $$ D(x) = \min_{i=1}^{k-1} \| x - \mu_i \|^2 $$
   
   b. Choose the next centroid $ \mu_k $ randomly from the data points, where the probability of choosing point $ x $ is proportional to $ D(x)^2 $.

   c. Repeat until all $ K $ centroids are chosen.

- **K-means++ Initialization**: This approach aims to spread out the initial centroids, leading to better and more reliable clustering.
- **Iterative Optimization**: The algorithm refines the clusters iteratively to minimize within-cluster variances.
- **Convergence**: The algorithm typically converges to a solution that, while not necessarily globally optimal, is often a good approximation for practical purposes.

#### Step 2: Assignment of Data Points to Clusters

Once the centroids are initialized, proceed with the iterative K-means process:

1. **Assignment Step**: Each data point is assigned to the nearest cluster. For each data point $ x_i $, find the nearest centroid and assign $ x_i $ to that cluster. The assignment can be expressed as:
   $$ 
   S_i^{(t)} = \{ x_p : \| x_p - \mu_i^{(t)} \|^2 \leq \| x_p - \mu_j^{(t)} \|^2 \forall j, 1 \leq j \leq K \}
   $$
   Here, $ S_i^{(t)} $ represents the set of data points assigned to the $ i $-th cluster at iteration $ t $.

#### Step 3: Update the Centroids

2. **Update Step**: Calculate the new centroids as the mean of all points assigned to each cluster. The new centroid $ \mu_i^{(t+1)} $ for the $ i $-th cluster is computed as:
   $$ 
   \mu_i^{(t+1)} = \frac{1}{|S_i^{(t)}|} \sum_{x_j \in S_i^{(t)}} x_j 
   $$
   Here, $ |S_i^{(t)}| $ is the number of data points in the $ i $-th cluster at iteration $ t $.

#### Step 4: Convergence Check

The assignment and update steps are repeated until the centroids stabilize, meaning the assignments no longer change or the changes are below a certain threshold. This iterative process ensures that the algorithm converges to a set of centroids that best represent the clusters in the dataset.


## Pros and Cons

K-means is a widely used clustering algorithm due to its simplicity and efficiency. However, like any algorithm, it has its strengths and weaknesses. Here are some of the pros and cons of K-means:

### Pros of K-means:

1. **Simple and Easy to Implement**: K-means is straightforward to understand and implement, making it a popular choice for many clustering tasks.

2. **Efficiency**: It is computationally efficient, especially for large datasets, due to its linear complexity $ O(n) $, where $ n $ is the number of data points.

3. **Scalability**: K-means can be easily scaled to large data sets and high-dimensional data.

4. **Well-suited for Spherical Clusters**: It works well when the clusters are distinct and well-separated.

5. **Adaptability**: K-means can be easily adapted for a wide range of different domains and types of data.

6. **Good for Hard Clustering**: It provides a clear partitioning of the dataset, assigning each data point to a single cluster.

### Cons of K-means:

1. **Requirement of Specifying $ K $**: You need to specify the number of clusters ($ K $) in advance, which can be challenging without domain knowledge or additional methods like the Elbow method.

2. **Sensitivity to Initial Centroids**: The final results can vary based on the initial choice of centroids. K-means++ helps alleviate this issue but doesn't completely eliminate it.

3. **Poor Performance with Non-Spherical Clusters**: K-means assumes that clusters are spherical and of similar size, which might not always be the case. It performs poorly with complex geometrical shaped data or clusters of varying sizes and densities.

4. **Local Optima**: K-means may converge to a local optimum depending on the initial centroid positions. This means it doesn't guarantee a globally optimal solution.

5. **Sensitive to Outliers**: Outliers can significantly skew the means of the clusters, leading to inaccurate clustering.

6. **Not Suitable for Categorical Data**: K-means is primarily designed for continuous numerical data and does not work well with categorical data.

7. **Lack of Hierarchical Structure**: K-means does not provide any hierarchical relationship among clusters.

8. **Feature Scaling Dependency**: The performance of K-means is heavily influenced by the scale of the data. Features need to be scaled for the algorithm to work properly.

In summary, while K-means is a powerful tool for certain clustering tasks, its effectiveness can be limited by the nature of the data and the specific requirements of the application. It's often used as a first-line approach, with more complex algorithms considered if K-means proves inadequate.

## Code

## Numpy

In [1]:
from Kmeans_numpy import *
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from datasets import load_dataset
from sklearn.cluster import KMeans

In [2]:
dataset = load_dataset("imodels/diabetes-readmission", split='train')
df = dataset.to_pandas()
df.head(5)

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,change,diabetesMed,...,glyburide-metformin:Up,A1Cresult:>7,A1Cresult:>8,A1Cresult:None,A1Cresult:Norm,max_glu_serum:>200,max_glu_serum:>300,max_glu_serum:None,max_glu_serum:Norm,readmitted
0,2.0,38.0,3.0,27.0,0.0,1.0,2.0,7.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
1,4.0,48.0,0.0,11.0,0.0,0.0,0.0,9.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
2,2.0,28.0,0.0,15.0,0.0,3.0,4.0,9.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
3,4.0,44.0,0.0,10.0,0.0,0.0,0.0,7.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
4,3.0,54.0,0.0,8.0,0.0,0.0,0.0,8.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0


In [3]:
X = np.array(df.iloc[:,:-1])
y = np.array(df.iloc[:,-1])
print(X.shape)
print(y.shape)

(81410, 150)
(81410,)


In [4]:
centroids, assignments = kmeans_plusplus(X, 2)
print("Accuracy:", np.sum(assignments == y)/len(y))

Accuracy: 0.486795234000737


### Sklearn

In [5]:
kmeans = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=300, random_state=42)

# Fit the model
kmeans.fit(X)

# Centroids
centroids = kmeans.cluster_centers_

# Cluster labels for each point
labels = kmeans.labels_
print("Accuracy:", np.sum(labels == y)/len(y))

Accuracy: 0.48680751750399215
