## Introduction:
* K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
* The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
* You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
* Every data point is allocated to each of the clusters through reducing the within-cluster sum of squares.
* In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
* The **means** in the K-means refers to averaging of the data; that is, finding the centroid.

![kmeans.png](attachment:kmeans.png)

## Algorithm:
**1)** First select value of K(which refers to the number of centroids you need in the dataset)

**2)** Let's say if K=2, then number of centroids are also 2. So first initialize these two centroids randomly

Now we have to find the points which are nearer to first centroid and points nearer to second centroid. To find distance we can use **Euclidean** or **Manhattan distance**. We can see in below picture B, we draw two straight lines and whichever points are coming near blue(i.e right of the line will be considered as blue points and point on red side will be considered as red points.

**3)** Now we have to select the clusters and find the mean(average) value of each cluster. Now we have to change our centroid points to these mean points. So we will have two new centroids, one for red cluster and one for blue cluster.

**4)** Then again we will create a straight line between centroids, and again we will follow same steps from step 2 till step 4. Instead of random point selection, this current points will be our points to find the distance. So now if any point, which was in red cluster may fall in blue cluster or vice versa.

**5)** These steps will happen until there is no movement of the points from one cluster to another. Then we will have two clusters ready and also model will be ready.
![kmeans_algo.png](attachment:kmeans_algo.png)


### How to select value of K?
* For selecting K value we have something called **Elbow method**.
* Elbow method says that:
    * We will run loop from K = 1 to some number(let's say 20)
    * Now for each value of K we will run whole process of K Means.
    * For every time value of K(for each iteration) we will calculate WCSS(within cluster sum of squares), here Xi are all the points and Ci is the selected centroid.
    * WCSS is the sum of squares of the distances of each data point in all clusters to their respective centroids. The idea is to minimise the sum.
    * Like for k=2, there are two centroids, for both centroids we will calculate difference with each point.  
    ![wcss.jpeg](attachment:wcss.jpeg)
    
    * When we start increasing value of K, our graph will look like:
    
    ![elbow_method_graph.png](attachment:elbow_method_graph.png)
    
    * Now we have to select last value which had abrupt(sudden) decrease. In above graph, we can have k=5

## Drawback of standard K-means algorithm:

* One disadvantage of the K-means algorithm is that it is sensitive to the initialization of the centroids or the mean points.
* So, if a centroid is initialized to be a “far-off” point, it might just end up with no points associated with it, and at the same time, more than one cluster might end up linked with a single centroid.
* Similarly, more than one centroids might be initialized into the same cluster resulting in poor clustering. For example, consider the images shown below. 

![poor_clustering.png](attachment:poor_clustering.png)

* Instead clustering should have been, like:
![ideal_clustering.png](attachment:ideal_clustering.png)


## K-means++:
* To overcome the above-mentioned drawback we use K-means++. This algorithm ensures a smarter initialization of the centroids and improves the quality of the clustering.
* Apart from initialization, the rest of the algorithm is the same as the standard K-means algorithm. That is K-means++ is the standard K-means algorithm coupled with a smarter initialization of the centroids.

* Initialization algorithm, The steps involved are:
    1. Randomly select the first centroid from the data points.
    2. For each data point compute its distance from the nearest, previously chosen centroid.
    3. For each subsequent centroid, choose it from the data points such that the probability of choosing a point as centroid is directly proportional to its squared distance from the nearest centroid already chosen. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
    4. Repeat steps 2 and 3 until k centroids have been sampled.


* Although the initialization in K-means++ is computationally more expensive than the standard K-means algorithm, the run-time for convergence to optimum is drastically reduced for K-means++. This is because the centroids that are initially chosen are likely to lie in different clusters already.