# Overview of Clustering and Essential Concepts

**Clustering** is type of Unsupervised learning technique where the objective is to arrive at conclusions based on patterns found within the unlabelled data. This technique is mainly used to seggregate large data into subgroups in order to make informed decisions. Clustering algorithms divides data into $n$ different cluesters in such a way that each cluster have similar features, whereas they differ significantly from the data points in other clusters.

# Clustering types

Clustering algorithms can classify data points using a methodology that is either hard or soft. The former designates data points completely to a cluster, where as the later method calculates the probabilities of the data point belonging to each of the clusters. Considering the clusters are created based on similarity between data points. they can be divided into several groups depending on the set of rules used for calculating similarity:


+ **Connectivity based models**: This model's approach to similarity is based on proximity in a data space. The creation of clusters can be done by assigning all data points to a single cluster and then partitioning the data into smaller clusters as the distance between data points increases. Likewise, the algorithm can also start by assigning each data point an individual cluster, and then aggregating data points that are close by. An example of a connectivity-based model is hierarchical clustering. 

+ **Density based models:** As the name suggests, these models define clusters by their density in the data space. This means that areas with a high density of data points will become clusters, which are typically separated from one another by low-density areas. An example of this is the DBSCAN algorithm.

+ **Distribution based models:** Models that fall into this category are based on the probability that all the data points from a cluster follow the same distribution, such as a Gaussian distribution. An example of such a model is the Gaussian Mixture algorithm, which assumes that all data points come from a mixture of a finite number of Gaussian distributions.

+ **Centroid based models:** These models are based on algorithms that define a centroid for each cluster, which is updated constantly by an iterative process. The data points are assigned to the cluster where their proximity to the centroid is minimized. An example of such a model is the k-means algorithm.

There are few important poitns one need to consider while workign with clustering algorithms:

+ Distance Metrics
+ Feature Scaling, and
+ Number of dimensions

# Distance metrics

![](data/distance.png)


In clustering algorithms we often use distance to calculate similarity between two data points. For different types of data we calculate different types of distances. The most commonly used is Eucledian distance in case numerical data types and Hamming distance for categorical data types. Though there are many distance metrics to discuss. I'll limit myself to discuss few important ones here. You can find more about distance metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html)

## Euclidean distance

![](data/euc.png)


Euclidean distance is the metric, I think most of us are familiar with. This distances can be 0 or take on any positive real number. It is given by the root sum-of-squares of differences between each pair (x,y) of points. And we can see that for high dimensions we simply add the distance.

$$ d_n(x,y) =  \sqrt {(x_1-y_1)^2 + (x_2-y_2)^2 + (x_3-y_3)^2 + ... (x_n-y_n)^2} $$

## Manhattan distance

![](data/man.png)

Manhattan distance is again the sum of the absolute numerical difference between two points in space, but using cartesian coordinates. Whilst Euclidean distance is the straight-line 'as the crow flies' with Pythagoras theorem, Manhattan takes distance as the sum of the line vectors (x,y) like a taxi ridining among blocks of building.

$$d_n(x,y) = \sum_{i=1}^{n}{|(x_i-y_i)|}$$


## Minkowski Distance

The distance can be calculated using the below formula:

$$d_n(x,y) = {\biggr(\sum_{i=1}^{n}|(x_i-y_i)|^p\biggr)}^{1/p}$$

Minkowski distance is a generalized distance metric. We can manipulate the above formula by substituting ‘p’ to calculate the distance between two data points in different ways. Thus, Minkowski Distance is also known as Lp norm distance.
Some common values of 'p' are:
+ p = 1, Manhattan Distance
+ p = 2, Euclidean Distance
+ p = infinity, Chebychev Distance


## Hamming distance

![](data/hamm.png)

Hamming distance is used to calculate the distances between categorical variables(sometimes called it nominal variable). Number of difference between two binary strings is represented using Hamming distance. For categorical variable there is no existence of order. This particular property of categorical variable insists to calculate the change of categorical values in respect of binary values. When only categorical features are available in data set we can use hamming distance to calculate similarity between two data points. If two vectors are of uneual length then we cannot compare.



## Grower distance

Gower (1971) distance is a hybrid measure that handles both continuous and categorical data.

If the data feature are continuous or ordinal, the Manhattan or a ranked ordinal Manhattan is applied respectively.
If the data feature are categorical, then a DICE coefficient is applied. DICE is explained here. However, If you are familiar with Jaccard coefficient and or binary classification (e.g. True Positives TP and False Positives FP etc) and confusion matrices then DICE is going to be familiar as


$$DICE = \frac{2|X\cap Y}{|X|+|Y|} = \frac{2TP}{2TP + FP + FN}$$
 
 
The Gower distance of a pair of points $G(p,q)$ then is:

$$G_n(p,q) = \frac{\sum_{i=1}^{n}W_{pqk}S_{pqk}}{\sum_{i=1}^{n}W_{pqk}}$$
 
where $S_{pqk}$ is either the Manhattan or DICE value for feature $k$, and $W_{pqk}$ is either 1 or 0 if $k$ feature is valid. Its the sum of feature scores divided by the sum of feature weights.

## Cosine similarity

Cosine similarity has often been used as a way to counteract Euclidean distance’s problem with high dimensionality. The cosine similarity is simply the cosine of the angle between two vectors. It also has the same inner product of the vectors if they were normalized to both have length one.
Two vectors with exactly the same orientation have a cosine similarity of 1, whereas two vectors diametrically opposed to each other have a similarity of -1. Note that their magnitude is not of importance as this is a measure of orientation. One disadvantage here is magnitude of vector is not considered.

$$d_{(x,y)} = cos(\theta) = \frac{x.y}{||x||\ ||y||}$$

We use cosine similarity often when we have high-dimensional data and when the magnitude of the vectors is not of importance. For text analyses, this measure is quite frequently used when the data is represented by word counts.


## Haversine distance

![](data/hav.png)

Haversine distance is the distance between two points on a sphere given their longitudes and latitudes. It is very similar to Euclidean distance in that it calculates the shortest line between two points. The main difference is that no straight line is possible since the assumption here is that the two points are on a sphere.

Haversine distance between two points:

$$ d = 2\ r\ \sin^{-1}\biggr(\sqrt{\sin^2\big(\frac{\phi_2-\phi_1}{2}\big) + \cos(\phi_1)\cos(\phi_2)\sin^2\big(\frac{\lambda_2-\lambda_1}{2}\big)}\biggr)$$

Disadvantages
One disadvantage of this distance measure is that it is assumed the points lie on a sphere. In practice, this is seldom the case as, for example, the earth is not perfectly round which could make calculation in certain cases difficult

## Jaccard distance

![](data/jac.png)

The Jaccard index (or Intersection over Union) is a metric used to calculate the similarity and diversity of sample sets. It is the size of the intersection divided by the size of the union of the sample sets.
In practice, it is the total number of similar entities between sets divided by the total number of entities. For example, if two sets have 1 entity in common and there are 5 different entities in total, then the Jaccard index would be 1/5 = 0.2.
To calculate the Jaccard distance we simply subtract the Jaccard index from 1:

$$D_{(x,y)} = 1 - \frac{|x\cap y|}{|x\cup y|}$$

A major disadvantage of the Jaccard index is that it is highly influenced by the size of the data. Large datasets can have a big impact on the index as it could significantly increase the union whilst keeping the intersection similar. The Jaccard index is often used in applications where binary or binarized data are used. When you have a deep learning model predicting segments of an image, for instance, a car, the Jaccard index can then be used to calculate how accurate that predicted segment given true labels. Similarly, it can be used in text similarity analysis to measure how much word choice overlap there is between documents. Thus, it can be used to compare sets of patterns.

# Feature Scaling

It is always advisable to bring all the features to the same scale for applying distance based algorithms like KNN.
Let's see an example of distance calculation using two features whose magnitudes/ranges vary greatly.  

$$ Eucledian\ distance\ = \sqrt{(820000 - 325000)^2 - (3.75- 0.50)^2} $$



From the above equation we can see the contributions from features with high magnitude over powers the ones with less magnitudes. So it is advisable to normalize the data to range 0-1 for better results. 

One can use sklearn MinMax Scaler to do so.  

$$ X_{scaled} = \frac{X - X_{min} }{X_{max} -X_{min}}$$


**Note:** We are not using Standardization because here we dont assume data to be coming from any particular distribution.

# Curse of dimentionality

The Curse of Dimensionality refers to is when your data has too many features. When we have too many features, observations become harder to cluster. Too many dimensions causes every observation in the dataset to appear equidistant from all the others. And because clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations, this is a big problem. If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed.



# Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most popular applications of clustering are:

+ Recommendation engines
+ Market segmentation
+ Social network analysis
+ Search result grouping
+ Medical imaging
+ Image segmentation
+ Anomaly detection