# Dimensionality Reduction & Clustering
In this notebook you can find all experiments used for dimensionality reduction and clustering, including part of the sanity checks mentioned in the report and the UMAP hyperparameter search.

Please note that this notebook has been designed to be used in a google colab setup, as the UMAP iterations used for the hyperparemter search are computationally expensive. All colab-specific code sections have been highlighted as such.

We recommend to import the entire SanityChecks_HyperparamSearch-Folder into your colab drive to ensure a seamless working of this notebook. In order to be able to use the defined path structure for embedding retrieval and plot storage the folder shall be placed directly in the drive and not within a subfolder.

# 0. Setup
We will begin by importing all required libraries and fixing the plotting setup.

In [None]:
#imports
import pandas as pd
import numpy as np
import pickle
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import matplotlib.colors as mcolors
import seaborn as sns
import random
import os
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
from sklearn.cluster import Birch
from sklearn.cluster import OPTICS
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn import mixture
from sklearn.mixture import BayesianGaussianMixture
from scipy.spatial.distance import pdist, squareform
from scipy.spatial import distance
#colab specific: ######################################################################
!pip install umap-learn[plot]
!pip install holoviews
!pip install -U ipykernel
#######################################################################################
import umap

In [2]:
#plot settings
%matplotlib inline
sns.set(style='white', context='notebook', rc={'figure.figsize':(3,3)})
plt.ioff() #ensures that no plots are shown within the notebook if not explicity demanded by plt.show()

#color configuration
colors = list(mcolors.CSS4_COLORS.keys())
colors = random.sample(colors, 100)
colors = np.array(colors)

# 1. Setting the Parameters
For a better legibility of this document, we will store all parameter settings in this section.

In [3]:
#colab specific path setup for plot storage: #######################################################
#Path to colab workspace
w_path = '/content/gdrive/MyDrive/SanityChecks_HyperparamSearch/'
####################################################################################################

#Pre-processing of the imported data - choose between...
# 'feature_stand': feature standardization leading to unit vairance and zero mean of all features across the samples
# 'norm_vecs': Normalized embedding vectors, that project all embedding vectors on a unit sphere
# 'none'
pre_processing = 'none'


#Dimensionality reduction of the imported data - choose between...
# 'PCA'
# 'UMAP'
dim_reduction = 'UMAP'


#Data-generation network (parameter is useful in case several sets of embeddings have been used which have been trained using different network architectures)
data_generation = 'final'

# 2. Loading the embeddings

Now, we will have to load the user embeddings from the npy file they are stored in.

We also want to check that the imported data has the desired dimensions, to make sure that nothing went wrong throughout the process of creating and storing the embeddings in the npy file, and importing them into this document.

In [12]:
#colab specific import setup: ######################################################################
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
####################################################################################################

#colab specific path setup for plot storage: #######################################################
#importing the high-dimensional user embeddings retrieved from the model pre-training
path= w_path + 'user_embeddings_final.npy'
####################################################################################################

#loading the data
data = np.load(path)

#creating folder to sort images into
hyperparam_path = w_path + 'hyperparameter_search'
if os.path.exists(hyperparam_path) == False:
  os.mkdir(hyperparam_path)
data_gen_path = hyperparam_path + '/' + data_generation
if os.path.exists(data_gen_path) == False:
  os.mkdir(data_gen_path)

#Check that embeddings have the correct shape
print(data.shape)

Mounted at /content/gdrive
(2062, 384)


# 3. Pre-processing the embeddings
Before passing our user embeddings on to the dimensionality reduction, we will have to pre-process them, to make sure that not a few features only dominate the dimensionality reduction due to scale differences.

In [13]:
#checking which means of data pre-processing to use
#   norm_vecs     = embedding normalization
#   feature_stand = feature standardization
#   none          = no pre-processing
if pre_processing == 'norm_vecs':
  scale_factors = np.sum(data, axis = 1)
  scale_factors = scale_factors[:, np.newaxis]
  processed_data = data / scale_factors
elif pre_processing =='feature_stand':
  processed_data = StandardScaler().fit_transform(data)
else:
  processed_data = data

#create folder to sort images into
pre_processing_path = data_gen_path + '/' + pre_processing
if os.path.exists(pre_processing_path) == False:
  os.mkdir(pre_processing_path)


# 4. Performing Dimensionality Reduction
In this section we perform hyperparameter search for dimensionality reduction. Dimensionality reduction can be performed using either UMAP or PCA. Please see section 1 in order to configer the transformation used.


In [14]:
#initializing dict to save reduced dimensionality embeddings in, and to later retrieve those embeddings based on the parameters
embeddings = {}


In [None]:
#Create folders to later deposit images in
path_dimred = pre_processing_path
if os.path.exists(path_dimred) == False:
  os.mkdir(path_dimred)

path2D = path_dimred + '/' + 'output_dim_2'
if os.path.exists(path2D) == False:
  os.mkdir(path2D)

path3D = path_dimred + '/' + 'output_dim_3'
if os.path.exists(path3D) == False:
  os.mkdir(path3D)

#Checking which means of dimensionality reduction to use
if dim_reduction == 'UMAP':

  #UMAP
  #iterating over plausible hyperparameter values: output dimensionality, number of neighbors, minimum distance and distance metric
  #note that output dimensionality is not an actual hyperparameter; we only use two dimensions as this helps for visual inspections of the results
  for n_dims in [2,3]:
    for n_neighbors in np.concatenate((range(2,11,1), range(15,51,5))): #n_neighbors in steps of 1 from 2-10 and in steps of 5 from 15 to 50
      for min_dist in range(0,10,1): #min_dist between 0 and 1 in steps of 0.1
        for measure in ['euclidean','manhattan','cosine']: #different distance measures

          reducer = umap.UMAP(n_components = n_dims, n_neighbors = n_neighbors, min_dist = min_dist/10, metric = measure) #initialize umap with desired hyperparams
          reduced_data = reducer.fit_transform(processed_data)#calculate umap and return reduced embeddings
          embeddings[data_generation+ pre_processing + dim_reduction + str(n_dims) + str(n_neighbors)+ str(min_dist/10) + measure] = reduced_data #saving embeddings for retrieval during clustering

          #two-dimensional case
          if n_dims == 2:
            #plotting result and saving plots as images
            plt.figure()
            plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=0.1)
            plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + str(n_neighbors)+ '     min_dist:' + str(min_dist/10) + '     metric:' + measure, fontsize=8)
            path2D_image = path2D + '/' + str(n_neighbors) + '_' + str(min_dist/10) + '_' + measure +'.pdf'
            plt.savefig(path2D_image, pad_inches = 15)

          #three-dimensional case
          else:
            #plotting result and saving plots as images
            plt.figure()
            ax = plt.axes(projection='3d')
            ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2],s=0.1)
            plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + str(n_neighbors)+ '     min_dist:' + str(min_dist/10) + '     metric:' + measure, fontsize=8)
            path3D_image = path3D + '/' + str(n_neighbors) + '_' + str(min_dist/10) + '_' + measure +'.pdf'
            plt.savefig(path3D_image, pad_inches = 15)

    if os.path.exists(data_gen_path +'/' + 'reduced_embeddings') == False:
      os.mkdir(data_gen_path +'/' +  'reduced_embeddings')

    with open(data_gen_path + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_UMAP_' + pre_processing + '_' + str(n_dims) + '.pickle', 'wb') as file:
      pickle.dump(embeddings, file)


else:
  #PCA
  #iterating over hyperparameter: output dimensionality
  for n_dims in [2,3]:
    pca = PCA(n_components = n_dims)
    reduced_data = pca.fit_transform(processed_data)
    embeddings[data_generation+ pre_processing + dim_reduction + str(n_dims) + 'na'+ 'na' + 'na'] = reduced_data #saving embeddings for retrieval during clustering

    #two-dimensional case
    if n_dims == 2:
      #plotting result and saving plots as images
      plt.figure()
      plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction, fontsize=8)
      path2D_image = path2D + '/'+ 'PCA' + '.pdf'
      plt.savefig(path2D_image, pad_inches = 15)

    #three-dimensional case
    else:
      #plotting result and saving plots as images
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2],s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction, fontsize=8)
      path3D_image = path3D + '/' + 'PCA' +'.pdf'
      plt.savefig(path3D_image, pad_inches = 15)

    with open(data_gen_path + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_PCA_' + pre_processing + '.pickle', 'wb') as file:
      pickle.dump(embeddings, file)



# 5. Selecting suitable reduced embeddings
As a next step, we will have to visually inspect the resulting plots to determine the most suitable reduced dimensionality embeddings. We will note down the hyperparamters used to create these embeddings.

The chosen embeddings can be retrieved later when clustering by listing the hyperparameters used to create them as a keyword for the "*embeddings*" dictionary.

In [None]:
#extracting the saved embeddings from our files

umap_feature_stand2 = {}
umap_feature_stand3 = {}
umap_norm_vecs2 = {}
umap_norm_vecs3 = {}
umap_none2 = {}
umap_none3 = {}
pca = {}

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_UMAP_none_2.pickle', 'rb') as file:
    umap_none2 = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_UMAP_none_3.pickle', 'rb') as file:
    umap_none3 = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_UMAP_feature_stand_2.pickle', 'rb') as file:
    umap_feature_stand2 = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_UMAP_feature_stand_3.pickle', 'rb') as file:
    umap_feature_stand3 = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_UMAP_norm_vecs_2.pickle', 'rb') as file:
    umap_norm_vex2 = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_UMAP_norm_vecs_3.pickle', 'rb') as file:
    umap_norm_vex3 = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_PCA_feature stand.pickle', 'rb') as file:
    pca_feature_stand = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_PCA_norm_vecs.pickle', 'rb') as file:
    pca_norm_vecs = pickle.load(file)

with open(w_path + data_generation + '/' + 'reduced_embeddings'+ '/' + 'red_embeddings_PCA_none.pickle', 'rb') as file:
    pca_none = pickle.load(file)


#Merging the saved embeddings into one dictionary
embeddings = umap_none2 | umap_none3 | umap_feature_stand2 | umap_feature_stand3 | umap_norm_vex2 | umap_norm_vex3 | pca_feature_stand | pca_norm_vecs | pca_none


In [None]:
#selecting best values from previous dimensionality reduction and placing them in iterable array

########## in case there's just one hyperparameter configuration ################################
data_gen_array = [data_generation, data_generation]
pre_proc_array = ['none', 'none']
dim_red_array = ['UMAP', 'UMAP']
n_dims_array = ['2','3']
neighbors_array = ['2','2']
dist_array = ['0.4', '0.4']
metric_array = ['cosine', 'cosine']
iter_array = np.column_stack((data_gen_array, pre_proc_array, dim_red_array, n_dims_array, neighbors_array, dist_array, metric_array))
##################################################################################################



########## in case of many hyperparameter configurations ########################################
#data_gen_array = np.full((34,), data_generation)

#pre_proc_array_1 = np.full((21,), 'feature_stand')
#pre_proc_array_2 = np.full((13,), 'norm_vecs')
#pre_proc_array = np.concatenate((pre_proc_array_1, pre_proc_array_2))


#dim_red_array = np.full((34,),'UMAP')

#n_dims_array_1 = np.full((10,),'2')
#n_dims_array_2 = np.full((11,), '3')
#n_dims_array_3 = np.full((7,),'2')
#n_dims_array_4 = np.full((6,), '3')
#n_dims_array = np.concatenate((n_dims_array_1, n_dims_array_2, n_dims_array_3, n_dims_array_4))

#neighbors_array = np.full((34,), '2')

#dist_array = ['0.4', '0.3', '0.6', '0.3', '0.7', '0.9', '0.3', '0.7', '0.5', '0.4', '0.4', '0.3', '0.6', '0.5', '0.6', '0.5', '0.6', '0.7','0.7','0.8','0.5','0.4','0.4','0.6','0.8','0.9','0.5','0.7','0.1', '0.2', '0.5', '0.6', '0.4','0.3']

#metric_array = ['cosine','cosine','cosine','euclidean','euclidean','euclidean','euclidean','manhattan','euclidean','euclidean','cosine','cosine','cosine','cosine','euclidean','manhattan','manhattan','euclidean','manhattan','euclidean','euclidean','euclidean','manhattan','euclidean','euclidean','euclidean','manhattan','euclidean','euclidean','manhattan','manhattan','manhattan','manhattan','euclidean']

#iter_array = np.column_stack((data_gen_array, pre_proc_array, dim_red_array, n_dims_array, neighbors_array, dist_array, metric_array))
###################################################################################################

##6 Sanity Check: Proximity Preservation
In this section, we want to see if the proximity of two points in the higher-dimensional embedding is preserved in the lower-dimensional embedding.

In order to do this, we will perform three checks:

**Check 1:** Does the single nearest neighbor of a point in the high-dimensional embedding lie within the same cluster as this point in the low-dimensional embedding?

**Check 2:** Do the 20 nearest neighbors of a point in the high-dimensional embedding lie within the same cluster as this point in the low-dimensional embedding?

**Check 3**: Does a large group of nearest neighbors of a point in the high-dimensinoal embedding lie within the same cluster as this point in the low-dimensional embedding?

**Check 4:** Does a point that is not in close proximity of another point in the high-dimensional embedding lie in a different cluster than this point in the low-dimensional embedding?

In [18]:
#Prepare folders to store resulting plots in

sanity_check_path = w_path + 'UMAP_sanity_checks'
check_1_path = sanity_check_path + '/' + 'check1'
check_2_path = sanity_check_path + '/' + 'check2'
check_3_path = sanity_check_path + '/' + 'check3'
check_4_path = sanity_check_path + '/' + 'check4'

if os.path.exists(sanity_check_path) == False:
  os.mkdir(sanity_check_path)
if os.path.exists(check_1_path) == False:
  os.mkdir(check_1_path)
if os.path.exists(check_2_path) == False:
  os.mkdir(check_2_path)
if os.path.exists(check_3_path) == False:
  os.mkdir(check_3_path)
if os.path.exists(check_4_path) == False:
  os.mkdir(check_4_path)

### 6.2.1 Check 1

For check 1 we first sample a user embedding, then determine its nearest neighbor in the **high-dimensional space** and finally we project all embeddings down to the lower dimensional space to visually inspect the proximity of the two observed embeddings.

We perform this procedure for four embeddings at a time in order to avoid a coincidental observation of a high-quality proximity preservation.

In [None]:
color_iter = ['crimson','olive','cyan','gold']
nr_embeddings = data.shape[0]
idx = np.random.randint(nr_embeddings, size=4)
neighbors = np.empty((4,2))

#iteration over different embedding pre-processings as they may impact the resulting proximity preservation
for pre_processing in ['none', 'feature_stand', 'norm_vecs']:
  if pre_processing == 'norm_vecs':
    scale_factors = np.sum(data, axis = 1)
    scale_factors = scale_factors[:, np.newaxis]
    processed_data = data / scale_factors
  elif pre_processing =='feature_stand':
    processed_data = StandardScaler().fit_transform(data)
  else:
    processed_data = data

  #iteration over different distance metric as they may impact the resulting proximity preservation
  for measure in ['euclidean', 'cosine','manhattan']:

    #sampling four user embeddings
    for i in range(4):
      sample_highdim = data[idx[i]]
      #determining the nearest neighbors of each embedding
      nn = NearestNeighbors(n_neighbors=2, metric = measure)
      nn.fit(data)
      sample_highdim = sample_highdim[np.newaxis,:]
      _ , neighbor = nn.kneighbors(sample_highdim)
      neighbors[i] = neighbor
    neighbors = np.asarray(neighbors, dtype = 'int')

    #iteration over different number of neighbors and minimum distance parameter values as they may impact the resulting proximity preservation
    for n_neighbors in [2, 5,10,30]:
      for min_dist in [0.2, 0.4]:

        #performing UMAP dimensionality reduction
        umap_reducer = umap.UMAP(n_components = 2, n_neighbors = n_neighbors, min_dist = min_dist, metric = measure) #initialize umap with desired hyperparams
        reduced_data = umap_reducer.fit_transform(processed_data) #calculate umap and return reduced embeddings

        #setting up the plot
        plt.figure()
        plt.title(str(n_neighbors) +'   ' + str(min_dist) +'   ' + measure + '\n' + pre_processing)
        plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=0.1, color = 'silver')

        for i in range(4):
          neighbor_embeddings = reduced_data[neighbors[i],:]
          color = color_iter[i]
          plt.scatter(neighbor_embeddings[:, 0], neighbor_embeddings[:, 1], s=0.3, color = color)

        plt.savefig(check_1_path + '/' + pre_processing + '_' + measure + '_' + str(n_neighbors) +'_' + str(min_dist) + '.pdf', bbox_inches='tight')

### 6.2.2 Check 2

For check 2 we follow the same procedure as in check 1, with the only difference being that we determine the 20 nearest neighbors of each sampled user embeddings.

In [None]:
color_iter = ['crimson','olive','cyan','gold']
nr_embeddings = data.shape[0]
idx = np.random.randint(nr_embeddings, size=4)
neighbors = np.empty((4,20))

#iteration over different embedding pre-processings as they may impact the resulting proximity preservation
for pre_processing in ['none', 'feature_stand', 'norm_vecs']:
  if pre_processing == 'norm_vecs':
    scale_factors = np.sum(data, axis = 1)
    scale_factors = scale_factors[:, np.newaxis]
    processed_data = data / scale_factors
  elif pre_processing =='feature_stand':
    processed_data = StandardScaler().fit_transform(data)
  else:
    processed_data = data

  #iteration over different distance metric as they may impact the resulting proximity preservation
  for measure in ['euclidean', 'cosine','manhattan']:

    #sampling four user embeddings
    for i in range(4):
      sample_highdim = data[idx[i]]
      #determining the nearest neighbors of each embedding
      nn = NearestNeighbors(n_neighbors=20, metric = measure)
      nn.fit(data)
      sample_highdim = sample_highdim[np.newaxis,:]
      _ , neighbor = nn.kneighbors(sample_highdim)
      neighbors[i] = neighbor
    neighbors = np.asarray(neighbors, dtype = 'int')

    #iteration over different number of neighbors and minimum distance parameter values as they may impact the resulting proximity preservation
    for n_neighbors in [2, 5,10,30]:
      for min_dist in [0.2, 0.4]:

        #performing UMAP dimensionality reduction
        umap_reducer = umap.UMAP(n_components = 2, n_neighbors = n_neighbors, min_dist = min_dist, metric = measure) #initialize umap with desired hyperparams
        reduced_data = umap_reducer.fit_transform(processed_data) #calculate umap and return reduced embeddings

        #setting up the plot
        plt.figure()
        plt.title(str(n_neighbors) +'   ' + str(min_dist) +'   ' + measure + '\n' + pre_processing)
        plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=0.1, color = 'silver')

        for i in range(4):
          neighbor_embeddings = reduced_data[neighbors[i],:]
          color = color_iter[i]
          plt.scatter(neighbor_embeddings[:, 0], neighbor_embeddings[:, 1], s=0.3, color = color)

        plt.savefig(check_2_path + '/' + '20points' + '_' + pre_processing + '_' + measure + '_' + str(n_neighbors) +'_' + str(min_dist) + '.pdf', bbox_inches='tight')

###6.2.3 Check 3

For check 3, we cluster the user embeddings in the high-dimensional space using K-Means, and then visualize the clusters in the lower-dimensional spcae.

In [25]:
#Dimensionality reduction
reducer = umap.UMAP(n_components = 3, n_neighbors = 20, min_dist = 0.4, metric = 'cosine') #initialize umap with desired hyperparams
reduced_data = reducer.fit_transform(data)#calculate umap and return reduced embeddings

#Clustering in the high-dimensioal space
kmeans = KMeans(init="random", n_clusters=8, n_init=10, max_iter=300, random_state=42)
kmeans.fit(data)
labels = kmeans.labels_ #extracting labels of each sample

#configuring plot settings, one color for each created label
plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2],c = np.take(colors, labels),s=0.1)
path3D_image = check_3_path + '/' + '3d_20nn_0.4md_cosine'+ '.pdf'
plt.savefig(path3D_image, bbox_inches = 'tight')

### 6.2.4 Check 4
For check four, we randomly sample a user embedding and then determine the user embedding farthest away in the high-dimensional space. We then visualize both embeddings in the low-dimensional space.

We perform this procedure for four embeddings at a time in order to avoid a coincidental observation of a high-quality proximity preservation.

In [None]:
color_iter = ['crimson','olive','cyan','gold']

#iteration over different embedding pre-processings as they may impact the resulting proximity preservation
for pre_processing in ['none', 'feature_stand', 'norm_vecs']:
  if pre_processing == 'norm_vecs':
    scale_factors = np.sum(data, axis = 1)
    scale_factors = scale_factors[:, np.newaxis]
    processed_data = data / scale_factors
  elif pre_processing =='feature_stand':
    processed_data = StandardScaler().fit_transform(data)
  else:
    processed_data = data

  #iteration over different distance metric as they may impact the resulting proximity preservation
  for metric in ['euclidean', 'cosine', 'manhattan']:
    n_neighbors = 2
    min_dist = 0.4
    umap_reducer = umap.UMAP(n_components = 2, n_neighbors = n_neighbors, min_dist = min_dist, metric = metric) #initialize umap with desired hyperparams
    reduced_data = umap_reducer.fit_transform(processed_data) #calculate umap and return reduced embeddings

    if metric == 'manhattan':
      metric_c = 'cityblock'
    else:
      metric_c = metric

    distance_vec = pdist(data, metric = metric_c)
    distance_matrix = squareform(distance_vec)

    #sampling four user embeddings
    for testpoints in range(4):

      #determining user embeddings furthest away from sampled embedding in high-dimensional space
      max_distance, [i,j] = np.nanmax(distance_matrix), np.unravel_index( distance_matrix.argmax(), distance_matrix.shape )
      distance_matrix[i,j] = 0
      distance_matrix[j,i] = 0
      color = color_iter[testpoints]

      #plot setup
      plt.figure()
      plt.title('Distant Points' + '\n' + pre_processing + str(n_neighbors) +'   ' + str(min_dist) +'   ' + metric )
      plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=0.1, color = 'silver')
      plt.scatter(reduced_data[i, 0], reduced_data[i, 1], s=0.3, color = color)
      plt.scatter(reduced_data[j, 0], reduced_data[j, 1], s=0.3, color = color)
      print(np.linalg.norm(reduced_data[i] - reduced_data[j]) )
      print('manhattan' + str(distance.cityblock(reduced_data[i], reduced_data[j])))
      print('cosine'+ str(distance.cosine(reduced_data[i], reduced_data[j])))
      plt.savefig(check_4_path + '/' + pre_processing + '_' + metric + '_' + str(n_neighbors) +'_' + str(min_dist) + '.pdf', bbox_inches='tight')


# 7. Clustering
Finally, we will have to cluster the reduced dimensionality embeddings.

In this section, we will try out different clustering algorithms with different hyperparameters each. We will furthermore use the CH-metric to support qualitative assesments of the clustering, enabling us to choose our final set of hyperparameters.

Please note that this metric calculates the quality score of the clustering based on distance, i.e. factors such as cluster diameter, average distance between cluster points, distance between separate clusters etc..
Yet, as UMAP focuses on preserving the local structure of the data, the distances between clusters as well as the size of the clusters themselves are not interpretable. Moreover, some of the used clustering algorithms do not cluster according to distance but according to density, distribution, or graph structures. Therefore, the used metric does not perfectly evaluate the quality of the clustering and merely serves as an approximative assistance for the evalutation.

## 7.1 K-Means

In [None]:
#creating dict to track ch scores
ch_tracker = {}
print(iter_array)

#iterating over the different hyperparameter values and reduced dimensionality embeddings
#we retrieve the embeddings from the embeddings dictionary created in section 4
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over number of clusters
  for n_clusters in np.concatenate((range(5,10,1),range(10,41,5))):

    #performing KMeans clustering
    kmeans = KMeans(init="random", n_clusters=n_clusters, n_init=10, max_iter=300, random_state=42)
    kmeans.fit(reduced_data)
    labels = kmeans.labels_ #extracting labels of each sample
    ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels),2)
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(n_clusters)] = ch_score

    #2-dimensional case
    if n_dims == '2':
      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score) + '    nr clusters:' + str(n_clusters) + '   Kmeans', fontsize=8)
      clustering_path = w_path  + data_generation + '/' + 'k_means' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path  + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score)+ '    nr clusters:' + str(n_clusters) + '   Kmeans', fontsize=8)
      clustering_path = w_path  + data_generation + '/' + 'k_means' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

Unnamed: 0,values
finalnoneUMAP220.4cosine40,3888.6
finalnoneUMAP220.4cosine35,3580.45
finalnoneUMAP220.4cosine30,3065.67
finalnoneUMAP220.4cosine25,3025.41
finalnoneUMAP220.4cosine20,3005.62


## 7.2 Agglomerative Clustering


In [None]:
#creating dict to track ch scores
ch_tracker = {}

#iterating over the different hyperparameter values and reduced dimensionality embeddings
#we retrieve the embeddings from the embeddings dictionary created in section 4
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over number of clusters
  for n_clusters in np.concatenate((range(5,10,1),range(10,41,5))):
    agglo = AgglomerativeClustering(n_clusters = n_clusters)
    agglo.fit(reduced_data)
    labels =  agglo.labels_
    ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels),2)
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(n_clusters)] = ch_score

    #2-dimensional case
    if n_dims == '2':

      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score) + '    nr clusters:' + str(n_clusters) + '\n Agglo', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'agglo' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:
      #configuring plot settings, one color for each created label
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score)+ '    nr clusters:' + str(n_clusters)+ '\n Agglo', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'agglo' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

Unnamed: 0,values
finalnoneUMAP220.4cosine40,4851.26
finalnoneUMAP220.4cosine35,4248.93
finalnoneUMAP220.4cosine30,3796.42
finalnoneUMAP220.4cosine25,3475.18
finalnoneUMAP320.4cosine40,3363.84


## 7.3 Spectral Clustering

In [None]:
#creating dict to track ch scores
ch_tracker = {}

#iterating over the different hyperparameter values and reduced dimensionality embeddings
#we retrieve the embeddings from the embeddings dictionary created in section 4
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over different number of clusters
  for n_clusters in np.concatenate((range(5,10,1),range(10,41,5))):
    spectral = SpectralClustering(n_clusters = n_clusters)
    spectral.fit(reduced_data)
    labels =  spectral.labels_
    ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels),2)
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(n_clusters)] = ch_score

    #2-dimensional case
    if n_dims == '2':

      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score) + '    nr clusters:' + str(n_clusters)+ '\n Spectral', fontsize=8)
      clustering_path = w_path  + data_generation + '/' + 'spectral' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:

      #configuring plot settings, one color for each created label
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score)+ '    nr clusters:' + str(n_clusters)+ '\n Spectral', fontsize=8)
      clustering_path = w_path  + data_generation + '/' + 'spectral' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

Unnamed: 0,values
finalnoneUMAP220.4cosine40,4114.15
finalnoneUMAP220.4cosine35,3588.9
finalnoneUMAP320.4cosine40,3120.19
finalnoneUMAP220.4cosine30,3034.22
finalnoneUMAP320.4cosine35,2853.73


## 7.4 BIRCH Clustering

In [None]:
#creating dict to track ch scores
ch_tracker = {}

#iterating over the different hyperparameter values and reduced dimensionality embeddings
#we retrieve the embeddings from the embeddings dictionary created in section 4
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over number of clusters
  for n_clusters in np.concatenate((range(5,10,1),range(10,41,5))):
    birch = Birch(n_clusters = n_clusters)
    birch.fit(reduced_data)
    labels =  birch.labels_
    ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels))
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(n_clusters)] = ch_score

    #2-dimensional case
    if n_dims == '2':

      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score) + '    nr clusters:' + str(n_clusters) + '\n BIRCH', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'BIRCH' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:

      #configuring plot settings, one color for each created label
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score)+ '    nr clusters:' + str(n_clusters) + '\n Birch', fontsize=8)
      clustering_path = w_path  + data_generation + '/' + 'BIRCH' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(n_clusters) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

## 7.5 Optics Clustering

In [None]:
#creating dict to track ch scores
ch_tracker = {}

#iterating over the different hyperparameter values and reduced dimensionality embeddings
#we retrieve the embeddings from the embeddings dictionary created in section 4
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over number of samples in a neighborhood for a point to be considered as a core point
  for min_samples in range(20,100,5):
    optics = OPTICS(min_samples = min_samples)
    optics.fit(reduced_data)
    labels = optics.labels_
    ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels))
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(n_clusters)] = ch_score

    #2-dimensional case
    if n_dims == '2':

      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '   -  ' + pre_processing + '  -   ' + dim_reduction + ' \n ' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + ' \n ch score:'+ str(ch_score) + '    min samples:' + str(min_samples) +  ' \n Optics', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'Optics' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(min_samples) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:

      #configuring plot settings, one color for each created label
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + ' \n ch score:' + str(ch_score)+ '    min samples:' + str(min_samples) + '\n Optics', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'Optics' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(min_samples) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

Unnamed: 0,values
finalnoneUMAP320.4cosine40,675
finalnoneUMAP220.4cosine40,314


## 7.6 DBScan Clustering

In [None]:
#creating dict to track ch scores
ch_tracker = {}

#iterating over the different hyperparameter values and reduced dimensionality embeddings
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over number of samples in a neighborhood for a point to be considered as a core point
  for min_samples in range(1,200,5):
    for eps in range(1, 100, 1):
      dbscan = DBSCAN(eps= eps/2, min_samples = min_samples)
      dbscan.fit(reduced_data)
      labels = dbscan.labels_
    if len(np.unique(labels)) > 1:
      ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels),2)
    else:
      ch_score = 0
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(min_samples) + str(eps)] = ch_score

    #2-dimensional case
    if n_dims == '2':

      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score) + '    min samples:' + str(min_samples) + '    eps:'+ str(eps) + '\n DBScan', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'DBScan' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(min_samples) + '_' + str(eps) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:

      #configuring plot settings, one color for each created label
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score)+ '    min samples:' + str(min_samples)+ '    eps:'+ str(eps) +'\n DBScan', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'DBScan' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(min_samples) + '_' + str(eps) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

Unnamed: 0,values
finalnoneUMAP220.4cosine199,0
finalnoneUMAP220.4cosine699,0
finalnoneUMAP220.4cosine1199,0
finalnoneUMAP220.4cosine1699,0
finalnoneUMAP220.4cosine2199,0


## 7.6 Gaussian Mixture Models

In [None]:
#creating dict to track ch scores
ch_tracker = {}

#iterating over the different hyperparameter values and reduced dimensionality embeddings
#we retrieve the embeddings from the embeddings dictionary created in section 4
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over number of clusters
  for nr_components in np.concatenate((range(5,10,1),range(10,41,5))):
    gmm = mixture.GaussianMixture(n_components=nr_components)
    labels = gmm.fit_predict(reduced_data)
    ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels),2)
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(nr_components)] = ch_score

    #2-dimensional case
    if n_dims == '2':

      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score) + '    nr components:' + str(nr_components) + '\n GMM', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'GMM' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(nr_components) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:

      #configuring plot settings, one color for each created label
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score)+ '    nr components:' + str(nr_components)+ '\n DBScan', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'GMM' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path  + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(nr_components) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

Unnamed: 0,values
finalnoneUMAP220.4cosine40,4082.12
finalnoneUMAP220.4cosine35,3541.39
finalnoneUMAP220.4cosine30,3396.23
finalnoneUMAP220.4cosine25,3142.05
finalnoneUMAP320.4cosine40,3119.29


##7.7 Bayesian Gaussian Mixture Model

In [None]:
#creating dict to track ch scores
ch_tracker = {}

#iterating over the different hyperparameter values and reduced dimensionality embeddings
#we retrieve the embeddings from the embeddings dictionary created in section 4
for data_generation, pre_processing, dim_reduction, n_dims, n_neighbors, min_dist, measure in iter_array: #iteration over different reduced dim embeddings
  reduced_data = embeddings[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure]

  #iterating over number of clusters
  for nr_components in np.concatenate((range(5,10,1),range(10,41,5))):
    vgmm = BayesianGaussianMixture(n_components=nr_components, random_state=42)
    labels = vgmm.fit_predict(reduced_data)
    ch_score = round(metrics.calinski_harabasz_score(reduced_data, labels),2)
    ch_tracker[data_generation + pre_processing + dim_reduction + n_dims + n_neighbors+ min_dist + measure + str(nr_components)] = ch_score

    #2-dimensional case
    if n_dims == '2':

      #configuring plot settings, one color for each created label
      plt.figure()
      plt.scatter(reduced_data[:,0], reduced_data[:,1], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score) + '    nr components:' + str(nr_components) + '\n B_GMM', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'B_GMM' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(nr_components) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

    #3-dimensional case
    else:

      #configuring plot settings, one color for each created label
      plt.figure()
      ax = plt.axes(projection='3d')
      ax.scatter3D(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c = np.take(colors, labels), s=0.1)
      plt.title(data_generation + '     ' + pre_processing + '     ' + dim_reduction + '\n' + 'n_neighbors:' + n_neighbors+ '     min_dist:' + min_dist + '     metric:' + measure + '\n' + 'ch score:'+str(ch_score)+ '    nr components:' + str(nr_components)+ '\n B_GMM', fontsize=8)
      clustering_path = w_path + data_generation + '/' + 'B_GMM' + '/' + pre_processing
      if os.path.exists(clustering_path) == False:
        os.mkdir(clustering_path)
      clustering_path = clustering_path + '/' + n_dims + '_' + n_neighbors + '_' + min_dist + '_' + measure + '_'+ str(nr_components) +'.pdf'
      plt.savefig(clustering_path , bbox_inches='tight')

In [None]:
#Determining the best clusters according to ch score and printing them in table
largest_keys = sorted(ch_tracker, key=ch_tracker.get, reverse=True)[:5]
largest_vals = [ch_tracker[x] for x in largest_keys]
length = len(largest_vals)
heading = np.empty(length, dtype = str)
heading[:] = 'value'
pd.DataFrame(largest_vals, index = largest_keys, columns=["values"])

Unnamed: 0,values
finalnoneUMAP320.4cosine30,2555.06
finalnoneUMAP220.4cosine15,2510.1
finalnoneUMAP320.4cosine40,2479.25
finalnoneUMAP320.4cosine35,2416.49
finalnoneUMAP220.4cosine25,2405.24
