# Cluster Analysis

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings 

warnings.filterwarnings('ignore')

A **cluster** is a group of items with similar characteristics (customers with similar habits, news about the same topics...)

Clustering algorithms: 
- Hierarchical clustering
- K Means Clustering
- DBScan, Gaussian methods...

## Hierarchical clustering 

Steps of **hierarchical clustering**:
- Initially, there is a cluster pero observation.
- We calculate distances between each pair of datapoints.
- The closest 2 datapoints are merged in a single cluster
- This process continues till we reach the desired amount of clusters.

In [None]:
from scipy.cluster.hierarchy import linkage, fcluster

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,
                10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.01]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
                47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]
df = pd. DataFrame({'x_coordinate': _coordinates,
                    'y_coordinate': y_coordinates})

In [None]:
sns.scatterplot(data=df, x='x_coordinate', y='y_coordinate')

In [None]:
Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion='maxclust')

In [None]:
sns.scatterplot(data=df, x='x_coordinate', y='y_coordinate', hue='cluster_labels')

## Kmeans

Steps:
- Cluster centroids are randomly initiallized.
- Distances from each datapoint to each cluster center is calculated.
- Datapoints are asigned to the closest centroid.
- Cluster centers are recalculated based on the datapoints assigned.
- Repeat this for a predefined number of iterations




In [None]:
from scipy.cluster.vq import kmeans, vq

centroids,_ = kmeans(df, 3)
df['cluster_labels_kmeans'],_ = vq(df, centroids) 

In [None]:
sns.set_palette("inferno")


sns.scatterplot(data=df, x='x_coordinate', y='y_coordinate', hue='cluster_labels_kmeans')

# Data preparation for cluster analysis

Why?
- Variables in incomparable units
- Variables with different scales and variances
- Data in raw form may lead to bias in clustering
- Clusters may be heavily dependent on one variable

Solution: normalization of individual variables

## Normalization of data

Normalization is the process of rescaling data to a standard deviation of 1

$$x\_new = \frac{ x }{ std(x) }$$


In [None]:
from scipy.cluster.vq import whiten

data = [5,1,3,7,3,5,8,2]

scaled_data = whiten(data)
print(scaled_data)

In [None]:
plt.plot(data, label='original', c='red')
plt.plot(scaled_data, label='scaled')
plt.legend()
plt.show()