# SOM/HAC implementation using SUSI library

In this document, the SUSI lib is going to be used in order to provide an example of usage of the Self Organising Maps algorithm. 

Can do supervised, semi-supervised, and unsupervised classification

Links: 
* https://pypi.org/project/susi/
* https://github.com/felixriese/susi

Installing the library to computer:

1) From command line: **pip install susi**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import susi
from susi.SOMPlots import plot_nbh_dist_weight_matrix, plot_umatrix
import numpy as np
from matplotlib import pyplot as plt
from sompy.sompy import SOMFactory
import pandas as pd
import numpy as np
import sklearn as sk
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn import preprocessing
from mpl_toolkits.basemap import Basemap
from sklearn_som.som import SOM
from pandas import *


### Load data

Data is loaded into the local environment as a numpy array. 

In [None]:
df = pd.read_csv("data_all_nonan.csv")

df = df[["lat","lon","date","datetime","depth","temp","sal","source","SA","CT","rho","mlp2",
         "spiciness0","alpha","beta","month","tempAnomalies","salAnomalies","year"]]
clustering_vars = ["lat","lon","mlp2","spiciness0","tempAnomalies","salAnomalies"]
df = df.fillna(0)
data = df[clustering_vars].values
N = 500
data = data[:N]
names = clustering_vars

### SOM training

In [None]:
msz = [20,25]
som = susi.SOMClustering(
    n_rows=msz[0],
    n_columns=msz[1]
)
X_transformed = som.fit_transform(data)
#som.fit(X_transformed)
print("SOM fitted!")

### Quality Measures

To check the quality of the map we look for
 1) accuracy in the data representation (using average quantization error
2)  accuracy in representing the data set topology (using the topographic error measure
For quantifying the error of the approximation, 2 metrics should be computed: 

- **The quantization error**: average distance between each data vector and its BMU.
- **The topographic error**: the proportion of all data vectors for which first and second BMUs are not adjacent units.

A rule of thumb is to generate several models with different parameters and choose the one which, having a topographic error very near to zero, has the lowest quantization error. It is important to hold the topographic error very low in order to make the components smooth and easy to understand. ts 

In [None]:
qe = som.get_quantization_error(data)
print('The quantization error is ' + str(qe))

## Visualization of SOM 

### Components map

#### No component planes visualization available

### Hits map
Visualize SOM nodes by density, each node displays how many input vectors are mapped onto each SOM node

NOTE: BMUs are given as a location (row,column), not the number of the node (1:msz)

In [None]:
bmu = som.get_bmus(data)
plt.hist2d([x[0] for x in bmu], [x[1] for x in bmu])

### U-matrix
The "U-matrix", or "distance matrix" shows the euclidean distances between neighboring units - therefore visualizes the cluster structure of the map. High values on the U-matrix means large distances between neighboring units and thus indicate cluster borders. Clusters typically have uniform areas of low values. 

In [None]:
umat = som.get_u_matrix()
plt.imshow(np.squeeze(umat))

### Visualize SOM "clusters" on map

In [None]:
#need to figure this out...
#need to assign numbers to each node, and then match up with the bmus (row,column)

In [None]:
nodes = som.get_clusters(data)
clus_nds = []
for x in range(499):
    nd = som.get_datapoints_from_node(nodes[x])
    clus_nds.append(nd) 

In [None]:
ts = ['temp','sal']
data_ts = df[ts].values
data_ts = data_ts[:N]

#cluster data into 500 arrays (each for one node)
dat_ts = []
for x in range(499):
    d = data_ts[clus_nds[x]]
    dat_ts.append(d)

#colors = plt.cm.jet(np.linspace(0,1,n))
for x in range(499):
    plt.scatter(dat_ts[x][0], dat_ts[x][1], c=node_num[x], cmap='rainbow', s=6)

## Hierarchical Agglomerative Clustering

In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram
clustering = linkage(nodes, method="average", metric="euclidean")
dendrogram(clustering)
plt.show()

logging.getLogger('matplotlib.font_manager').setLevel(logging.ERROR)
from scipy.cluster.hierarchy import cut_tree
import seaborn as sns
cluster_labels = cut_tree(clustering, n_clusters=6).reshape(-1, )
    

In [77]:
nodez = np.array(nodes) #turn nodes from list to np array 
un_nodes = np.unique(nodez, axis=0,  return_counts=True) # find unique values (each one being a node) 
num_reps = un_nodes[1] #get the number of times each node was repeated 
# repeat dummy variable the same amount of times as num_reps 
node_num = []
nn = len(un_nodes[0])
for x in range(nn):
    nd = np.tile(x, (num_reps[x], 1))
    node_num.append(nd) 


In [81]:
un_nodes = np.unique(nodez, axis=0,  return_counts=True)
un_nodes

(array([[ 0,  0],
        [ 0,  5],
        [ 0, 10],
        [ 0, 11],
        [ 0, 12],
        [ 0, 13],
        [ 0, 18],
        [ 0, 19],
        [ 0, 21],
        [ 1,  5],
        [ 1, 10],
        [ 2, 11],
        [ 3,  4],
        [ 3,  8],
        [ 3, 14],
        [ 3, 24],
        [ 4,  4],
        [ 4,  7],
        [ 4,  9],
        [ 4, 10],
        [ 4, 11],
        [ 4, 14],
        [ 4, 24],
        [ 5,  5],
        [ 5,  8],
        [ 5, 10],
        [ 5, 11],
        [ 5, 24],
        [ 6,  0],
        [ 6,  3],
        [ 6,  4],
        [ 6,  5],
        [ 6,  6],
        [ 6,  7],
        [ 6, 10],
        [ 6, 24],
        [ 7,  0],
        [ 7,  5],
        [ 7,  6],
        [ 7,  8],
        [ 7, 10],
        [ 7, 12],
        [ 8,  0],
        [ 8,  5],
        [ 8,  6],
        [ 8,  7],
        [ 9,  0],
        [ 9,  5],
        [ 9,  9],
        [10,  0],
        [10,  5],
        [10,  8],
        [10, 16],
        [10, 20],
        [10, 21],
        [1