# Analysis and Visualization of Complex Agro-Environmental Data
---
## Cluster Analysis - Agglomerative hierarchical clustering

Hierarchical agglomerative cluster analysis is one of the most widely used clustering approaches to group objects based on their dissimilarities. It is based on grouping successively objects and previously defined clusters. The aim of this approach is toward grouping data into a tree of clusters called dendrograms, which graphically represents the hierarchical relationship between the underlying clusters.

In `python` the method is implemented in the `dendrogram()` and `linkage()` functions of the `scipy` module, and in the `AgglomerativeClustering()` function of the `sklearn.cluster` module.

To run the analysis you first need to import necessary modules and functions:

In [None]:
import numpy as np # for getting our distribution
import pandas as pd # to handle data frames
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
from scipy import stats # to compute statistics
from scipy.cluster.hierarchy import dendrogram, linkage # to run the linkage method and create dendrograms
from sklearn.cluster import AgglomerativeClustering # to perform agglomerative clustering

Create a 2D DataFrame, with 2 variables.

In [None]:
df=pd.DataFrame({"X": [12,15,18,10,8,9,12,20,29,16,24,9,27], "Y": [6,16,17,8,7,6,9,18,8,14,6,7,9]})
df

Plot the data. It is evident that there are three groups of objects that are expected to be defined.

In [None]:
plt.scatter("X","Y", data=df)
plt.show()

Create dendrograms using `Scipy` Check the `linkage()` and `dendrogram()` functions help files.

In [None]:
help(linkage)

In [None]:
Help(dendrogram)

Compute a dendrogram with single linkage

In [None]:
dendrogram_plot = dendrogram(linkage(df, method='single'))
plt.title('single linkage' )
plt.xlabel('objects')
plt.ylabel('Distance')
plt.show()

Compute a dendrogram with complete linkage

In [None]:
dendrogram_plot = dendrogram(linkage(df, method='complete'))
plt.title('complete linkage' )
plt.xlabel('objects')
plt.ylabel('Distance')
plt.show()

Compute a dendrogram with average linkage

In [None]:
dendrogram_plot = dendrogram(linkage(df, method='average'))
plt.title('average linkage' )
plt.xlabel('objects')
plt.ylabel('Distance')
plt.show()

Compute a dendrogram with centroid linkage

In [None]:
dendrogram_plot = dendrogram(linkage(df, method='centroid'))
plt.title('centroid linkage' )
plt.xlabel('objects')
plt.ylabel('Distance')
plt.show()

Compute a dendrogram with Ward's linkage

In [None]:
dendrogram_plot = dendrogram(linkage(df, method='ward'))
plt.title('Ward linkage' )
plt.xlabel('objects')
plt.ylabel('Distance')
plt.show()

Compute a dendrogram with average linkage and other options

In [None]:
# run linkage
linkward = linkage(df, 
                   metric = 'cityblock', # cityblock or Manhattan dissimilarity for the dissimilarity matrix
                   method='average') # you may compare with other methods except 'centroid' and 'ward' which will only run only with euclidean distances.

# run dendrogram
plt.figure(figsize=(10, 8))
dendrogram_plot = dendrogram(linkward, 
                            truncate_mode='lastp',  # show only the last p merged clusters - important when there are too many objects
                            p=10,  # p merged clusters to show 
                            leaf_font_size=12.,
                            show_contracted=True,  # to get a distribution impression in truncated branches
                            orientation='right') # orientation 90º right
plt.title('Ward linkage' )
plt.xlabel('Distance')
plt.ylabel('Objects')

# set the number and cluster composition by considering a maximum distance of 8 by drawing a vertical line (x=8)
plt.axvline(x=8, color='r', linestyle='--')

Visualize the clusters in a scatter plot. Now we will use the `AgglomerativeClustering()` function of `sklear.cluster` that generates labels for each object (row in the DataFrame df) by defining the number of clusters we are interested on.


In [None]:
# run cluster analysis and define 3 clusters (equivalent to the clusters defined by the horizontal line in the previous dendrogram)
cluster3 = AgglomerativeClustering(n_clusters=3, # We are interested in only 3 clusters
                                    metric='manhattan', # equivalent to 'cityblock'
                                    linkage='average')
cluster3.fit_predict(df)
group3_labels = cluster3.labels_
group3_labels # labels of each group of objects (each line in df)

In [None]:
plt.scatter(df['X'], df['Y'], c=group3_labels)
plt.title('Average linkage' )
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

In [None]:
# The same but now considering 6 clusters
cluster6 = AgglomerativeClustering(n_clusters=6, 
                                    metric='manhattan', # equivalent to 'cityblock'
                                    linkage='average')
cluster6.fit_predict(df)
group6_labels = cluster6.labels_
plt.scatter(df['X'], df['Y'], c=group6_labels)
plt.title('Average linkage' )
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

#### Plot a heatmap with dendrogram by clustering rows using average linkage
The `Seaborn` module has an interesing visualization tool that help to visualize variable patterns and dissimilarities among objects in large datasets. It plots a heatplot combined with a dendrogram using a user defined linkage method, either to group objects (rows) or variables (columns).

In [None]:
sns.clustermap(df, col_cluster=False, row_cluster=True, method='average')
plt.show()

## Divisive Cluster Analysis (DIANA)

Another hierarchical clustering approach is the Divisive Hierarchical Cluster Analysis (DIANA), which is not so commonly used and, to our knowledge, it is not implemented in any python's module (but check e.g. here: https://github.com/div338/Divisive-Clustering-Analysis-Program-DIANA-/blob/master/divisive_clustering.py)

## References

An Introduction to Hierarchical Clustering in Python https://www.datacamp.com/tutorial/introduction-hierarchical-clustering-python

Hierarchical clustering. https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

SciPy Hierarchical Clustering and Dendrogram Tutorial. https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/