<a href="https://colab.research.google.com/github/lemon-mint20/Bachelorarbeit/blob/master/Bachelorarbeit_v5_2_Experiment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spectral Clustering – eine empirische Untersuchung** 

# Implementierung für Experiment 1


## Freie wissenschaftliche Arbeit zur Erlangung des akademischen Grades "Bachelor of Science"

Studiengang: Wirtschaftsinformatik

**an der Wirtschaftswissenschaftlichen Fakultät der Universität Augsburg**

Lehrstuhl für Statistik

Eingereicht bei: Prof. Dr. Yarema Okhrin

Betreuerin:      Christine Distler (M. Sc.)

Vorgelegt von:

Adresse:         
>
>
>



Augsburg, im März 2023




# Erstellung der synthetischen Datensätze

In [None]:
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
N_SAMPLES = 300
NOISE_BLOOBS = [0.07, 0.09, 0.11]
NOISE_CIRCLES = [0.07, 0.09, 0.11]
NOISE_MOONS = [0.05, 0.09, 0.13]

In [None]:
generated_datasets = []
ground_truth = []
X, y = datasets.make_blobs(n_samples=N_SAMPLES, centers=2, 
                             cluster_std=NOISE_BLOOBS[0], 
                             random_state=0, center_box=(-1.5, 1.5))
generated_datasets.append(X)
ground_truth.append(y)
X, y = datasets.make_circles(n_samples=N_SAMPLES, noise=NOISE_CIRCLES[0], 
                             factor = 0.4, random_state=1)
generated_datasets.append(X)
ground_truth.append(y)
X, y = datasets.make_moons(n_samples=N_SAMPLES, noise=NOISE_MOONS[0],
                             random_state=1)
generated_datasets.append(X)
ground_truth.append(y)
generated_datasets = np.array(generated_datasets)
ground_truth = np.array(ground_truth)

# Gridsearch initialisieren


In [None]:
import numpy as np
from scipy.spatial.distance import pdist, squareform
import pandas as pd
from IPython.display import display

## Ähnlichkeiten berechnen
Für die variable $\epsilon$ untersuche die Distanzen im Datensatz. Es muss sichergestellt werden, dass der $\epsilon$-Nachbarschaftsgraph verbunden ist. Die Distanzen können bei sich bei unterschiedlichen Distanzmaßen unterscheiden. Ich untersuche die Distanzen bei Distanzen 'euclidean' und 'correlation'

### Euklidische Distanz

In [None]:
distance_1 = squareform(pdist(generated_datasets[0], metric='euclidean'))
distance_2 = squareform(pdist(generated_datasets[1], metric='euclidean'))
distance_3 = squareform(pdist(generated_datasets[2], metric='euclidean'))
flatten_array = np.array([distance_1, distance_2, distance_3]).flatten()

plt.figure(figsize=(18, 10))
plt.hist(flatten_array, bins=500)
plt.show()
plt.boxplot(flatten_array)
plt.show()
df_dist_eucl = pd.DataFrame(flatten_array)
display(df_dist_eucl.describe())
print(distance_1.shape)

### Korrelations Distanz

In [None]:
distance_1 = squareform(pdist(generated_datasets[0], metric='correlation'))
distance_2 = squareform(pdist(generated_datasets[1], metric='correlation'))
distance_3 = squareform(pdist(generated_datasets[2], metric='correlation'))
flatten_array_corr = np.array([distance_1, distance_2, distance_3]).flatten()

plt.figure(figsize=(9, 5))
plt.hist(flatten_array_corr)
plt.show()
df_dist_corr = pd.DataFrame(flatten_array_corr)
display(df_dist_corr.describe())
display(np.unique(flatten_array_corr, return_counts=True))
display(distance_1)
print(distance_1.shape)

* Für die Metrik 'euclidean' wähle ich einen Bereich zwischen dem oberen und unteren Quantil.
Bei der Metrik 'correlation' habe ich deutlich weniger Ausprägungen als bei 'euclidean'. Auch die Werte bei Correlation müssen im Experiment berücksichtigt werden. 
* Als $\epsilon$ wähle ich Werte zwischen dem unteren und oberen Quantil. Ich erhoffe mir, ein Muster für die Clusterqualität in diesem Bereich zu erkennen
* Einen Anhaltspunkt zur Auswahl von $\sigma$ wird im Paper gegeben. Jedoch ist es nur eine Daumenregel. Ich wähle einen möglichst großen Bereich. 

In [None]:
from sklearn.model_selection import ParameterGrid
params = {
    'K': [2],
    # more metric arguments from scipy.spatial.distance.pdist
    'metric': ['euclidean', 'correlation'],
    'sim_graph': ['fully_connect', 'eps_neighbor', 'knn', 'mutual_knn'],
    'normalized': [1, 2, 3],
    'sigma': np.linspace(0.01, 1.5, 30),
    'knn': np.arange(2, 10),
    'epsi': np.linspace(df_dist_eucl.quantile(0.25), df_dist_eucl.quantile(0.75), 30),
}

grid = list(ParameterGrid(params))

In [None]:
len(grid)

Bei diesen vielen Kombinationen, dauert die Berechnung sehr lange und bricht ab. Außerdem werden Kombinationen erstellt, die garnicht gebraucht werden. Wenn z.B. $sim\_graph = 'fully\_connect'$, dann sind Kombinationen mit der Variable $epsi$ nicht zu gebrauchen. Daher optimiere ich die Gridsearch

In [None]:
from sklearn.model_selection import ParameterGrid
params1 = {
    'K': [2],
    # more metric arguments from scipy.spatial.distance.pdist
    'metric': ['euclidean', 'correlation'],
    'sim_graph': ['fully_connect'],
    'normalized': [1, 2, 3],
    'sigma': np.linspace(0.01, 1.5, 30),
}

grid1 = list(ParameterGrid(params1))
print(len(grid1))

params2_1 = {
    'K': [2],
    # more metric arguments from scipy.spatial.distance.pdist
    'metric': ['euclidean'],
    'sim_graph': ['eps_neighbor'],
    'normalized': [1, 2, 3],
    'epsi': np.linspace(df_dist_eucl.quantile(0.25), df_dist_eucl.quantile(0.75), 30),
}

grid2_1 = list(ParameterGrid(params2_1))
print(len(grid2_1))

params2_2 = {
    'K': [2],
    # more metric arguments from scipy.spatial.distance.pdist
    'metric': ['correlation'],
    'sim_graph': ['eps_neighbor'],
    'normalized': [1, 2, 3],
    'epsi': np.unique(flatten_array_corr),
}

grid2_2 = list(ParameterGrid(params2_2))
print(len(grid2_2))

params3 = {
    'K': [2],
    # more metric arguments from scipy.spatial.distance.pdist
    'metric': ['euclidean', 'correlation'],
    'sim_graph': ['knn'],
    'normalized': [1, 2, 3],
    'knn': np.arange(3, 20),
}

grid3 = list(ParameterGrid(params3))
print(len(grid3))

params4 = {
    'K': [2],
    # more metric arguments from scipy.spatial.distance.pdist
    'metric': ['euclidean', 'correlation'],
    'sim_graph': ['mutual_knn'],
    'normalized': [1, 2, 3],
    'knn': np.arange(3, 20),
}

grid4 = list(ParameterGrid(params4))
print(len(grid4))

grid = grid1+grid2_1+grid2_2+grid3+grid4
print(len(grid))

In [None]:
grid[180]


# Implementierung des Spectral Clustering von Dr. Yikun Zhang 

Quelle: [https://github.com/zhangyk8/Spectral-Clustering/blob/master/spectral_clustering.py](https://github.com/zhangyk8/Spectral-Clustering/blob/master/spectral_clustering.py)

(Zugriff: 15.12.2023)

Die Berechnung für den vollverbundenen Graph wurde von mir abgeändert zu: 
```
W = np.exp(-np.square(Adj_mat) / (2 * sigma)**2)
```




In [None]:
from sklearn.cluster import KMeans

# Based on "A Tutorial on Spectral Clustering" written by Ulrike von Luxburg
def Spectral_Clustering(X, K=8, adj=True, metric='euclidean', sim_graph='fully_connect', sigma=1.0, knn=10, epsi=0.5,
                        normalized=1):
    """
    Input:
        X : [n_samples, n_samples] numpy array if adj=True, or, a [n_samples_a, n_features] array otherwise;

        K: int, The number of clusters;

        adj: boolean, Indicating whether the adjacency matrix is pre-computed. Default: True;

        metric: string, A parameter passing to "scipy.spatial.distance.pdist()" function for computing the adjacency
        matrix (deprecated if adj=True). Default: 'euclidean';

        sim_graph: string, Specifying the type of similarity graphs. Choices are ['fully_connect', 'eps_neighbor',
        'knn', 'mutual_knn']. Default: 'fully_connect';

        sigma: float, The variance for the Gaussian (aka RBF) kernel (Used when sim_graph='fully_connect'). Default: 1;

        knn: int, The number of neighbors used to construct k-Nearest Neighbor graphs (Used when sim_graph='knn'
        or 'mutual_knn'). Default: 10;

        epsi: float, A parameter controlling the connections between points (Used when sim_graph='eps_neighbor').
        Default: 0.5;

        normalized: int, 1: Random Walk normalized version; 2: Graph cut normalized version; other integer values:
        Unnormalized version. Default: 1.

    Output:
        sklearn.cluster class, Attributes:
            cluster_centers_ : array, [n_clusters, n_features], Coordinates of cluster centers in K-means;
            labels_ : Labels of each point;
            inertia_ : float, Sum of squared distances of samples to their closest cluster center in K-means;
            n_iter_ : int, Number of iterations run in K-means.
    """
    # Compute the adjacency matrix
    if not adj:
      Adj_mat = squareform(pdist(X, metric=metric))
    else:
        Adj_mat = X
    # Compute the weighted adjacency matrix based on the type of similarity graphs
    if sim_graph == 'fully_connect':
        W = np.exp(-np.square(Adj_mat) / (2 * sigma)**2)
    elif sim_graph == 'eps_neighbor':
        W = (Adj_mat <= epsi).astype('float64')
    elif sim_graph == 'knn':
        W = np.zeros(Adj_mat.shape)
        # Sort the adjacency matrx by rows and record the indices
        Adj_sort = np.argsort(Adj_mat, axis=1)
        # Set the weight (i,j) to 1 when either i or j is within the k-nearest neighbors of each other
        for i in range(Adj_sort.shape[0]):
            W[i, Adj_sort[i, :][:(knn + 1)]] = 1
    elif sim_graph == 'mutual_knn':
        W1 = np.zeros(Adj_mat.shape)
        # Sort the adjacency matrx by rows and record the indices
        Adj_sort = np.argsort(Adj_mat, axis=1)
        # Set the weight W1[i,j] to 0.5 when either i or j is within the k-nearest neighbors of each other (Flag)
        # Set the weight W1[i,j] to 1 when both i and j are within the k-nearest neighbors of each other
        for i in range(Adj_mat.shape[0]):
            for j in Adj_sort[i, :][:(knn + 1)]:
                if i == j:
                    W1[i, i] = 1
                elif W1[i, j] == 0 and W1[j, i] == 0:
                    W1[i, j] = 0.5
                else:
                    W1[i, j] = W1[j, i] = 1
        W = np.copy((W1 > 0.5).astype('float64'))
    else:
        raise ValueError(
            "The 'sim_graph' argument should be one of the strings, 'fully_connect', 'eps_neighbor', 'knn', or 'mutual_knn'!")

    # Compute the degree matrix and the unnormalized graph Laplacian
    D = np.diag(np.sum(W, axis=1))
    L = D - W

    # Compute the matrix with the first K eigenvectors as columns based on the normalized type of L
    if normalized == 1:  ## Random Walk normalized version
        # Compute the inverse of the diagonal matrix
        D_inv = np.diag(1 / np.diag(D))
        # Compute the eigenpairs of L_{rw}
        Lambdas, V = np.linalg.eig(np.dot(D_inv, L))
        # Sort the eigenvalues by their L2 norms and record the indices
        ind = np.argsort(np.linalg.norm(np.reshape(Lambdas, (1, len(Lambdas))), axis=0))
        V_K = np.real(V[:, ind[:K]])
    elif normalized == 2:  ## Graph cut normalized version
        # Compute the square root of the inverse of the diagonal matrix
        D_inv_sqrt = np.diag(1 / np.sqrt(np.diag(D)))
        # Compute the eigenpairs of L_{sym}
        Lambdas, V = np.linalg.eig(np.matmul(np.matmul(D_inv_sqrt, L), D_inv_sqrt))
        # Sort the eigenvalues by their L2 norms and record the indices
        ind = np.argsort(np.linalg.norm(np.reshape(Lambdas, (1, len(Lambdas))), axis=0))
        V_K = np.real(V[:, ind[:K]])
        if any(V_K.sum(axis=1) == 0):
            raise ValueError(
                "Can't normalize the matrix with the first K eigenvectors as columns! Perhaps the number of clusters K or the number of neighbors in k-NN is too small.")
        # Normalize the row sums to have norm 1
        V_K = V_K / np.reshape(np.linalg.norm(V_K, axis=1), (V_K.shape[0], 1))
    else:  ## Unnormalized version
        # Compute the eigenpairs of L
        Lambdas, V = np.linalg.eig(L)
        # Sort the eigenvalues by their L2 norms and record the indices
        ind = np.argsort(np.linalg.norm(np.reshape(Lambdas, (1, len(Lambdas))), axis=0))
        V_K = np.real(V[:, ind[:K]])

    # Conduct K-Means on the matrix with the first K eigenvectors as columns
    kmeans = KMeans(n_clusters=K, init='k-means++', random_state=0).fit(V_K)
    return kmeans

# Spectral clustering auf alle Datensätze anwenden

In [None]:
import time as time
import warnings
from sklearn.metrics import adjusted_rand_score, silhouette_score, calinski_harabasz_score, davies_bouldin_score

warnings.filterwarnings("error")

Im selbstgenerierten Datensatz sind die ground truth classes bekannt. Daher Werte ich das Clustering mit dem ARI aus. 

In [None]:
def evaluate_single_dataset(n):
  # silhouette_scores = []
  # calinski_h_scores = []
  # davies_b_scores = []
  ari = []

  has_exception = []
  iter_with_LinAlgError = []
  iter_with_ValueError = []
  iter_with_RuntimeWarning = []
  iter_with_Errors_Or_Warnings = []
  iter_without_Errors = []
  # eigenvalues_of_one_dataset = []
  
  dataset_index = []
  metric = []
  simgraph = []
  sigma = []
  knn = []
  epsi = []
  normalized = []

  for i in range(len(grid)):
    try:
      clust = Spectral_Clustering(X=generated_datasets[n], **grid[i], adj=False)
    except np.linalg.LinAlgError:
      iter_with_LinAlgError.append(i)
      has_exception.append(1)
      continue
    except ValueError as e:
      # print(e)
      iter_with_ValueError.append(i)
      has_exception.append(1)
      continue
    except RuntimeWarning as e: 
      # print(e)
      iter_with_RuntimeWarning.append(i)
      has_exception.append(1)
      continue
    except:
      iter_with_Errors_Or_Warnings.append(i)
      has_exception.append(1)
      continue
    else:
      dataset_index.append(n)
      has_exception.append(0)
      metric.append(grid[i]['metric'])
      simgraph.append(grid[i]['sim_graph'])
      if 'knn' in grid[i]:
        knn.append(grid[i]['knn'])
        sigma.append(-1)
        epsi.append(-1)
      if 'sigma' in grid[i]:
        sigma.append(grid[i]['sigma'])
        knn.append(-1)
        epsi.append(-1)
      if 'epsi' in grid[i]:
        epsi.append(grid[i]['epsi'])
        knn.append(-1)
        sigma.append(-1)
      normalized.append(grid[i]['normalized'])
      ari.append(adjusted_rand_score(clust.labels_, ground_truth[n]))

      iter_without_Errors.append(i)
      #score1 = silhouette_score(generated_datasets[k], clust.labels_)
      #score2 = calinski_harabasz_score(generated_datasets[k], clust.labels_)
      # score3 = davies_bouldin_score(generated_datasets[k], clust.labels_)
      # silhouette_scores.append(score1)
      # calinski_h_scores.append(score2)
      # davies_b_scores.append(score3)
      # eigenvalues_of_one_dataset.append(eigenvalues)

      # print(f'\t Iteration: {i}')
      # print(f'silhouette:         {score1}')
      # print(f'calinski_harabasz:  {score2}')
      # print(f'davies_bouldin: 
      
  return (# silhouette_scores, 
          # calinski_h_scores, 
          # davies_b_scores, 

          metric,
          simgraph,
          sigma,
          knn,
          epsi, 
          normalized,
          
          iter_with_LinAlgError, 
          iter_with_ValueError, 
          iter_with_RuntimeWarning, 
          iter_with_Errors_Or_Warnings, 
          iter_without_Errors, 
          # eigenvalues_of_one_dataset, 
          dataset_index,
          ari)

In [None]:
datasets_results = {
    # 'silhouette_score': list(),
    # 'calinski_h_scores': list(),
    # 'davies_b_scores': list(),
    'ari': list(),
    
    'metric': list(),
    'sim_graph': list(),
    'sigma': list(),
    'knn': list(),
    'epsi': list(),
    'normalized': list(),

    'LinAlgError': list(),
    'ValueError': list(),
    'RuntimeWarning': list(),
    'Errors_Or_Warnings': list(),
    'no_Errors': list(),
    'eigenvalues': list(),
    'dataset_index': list(),
}


startTime = time.time()

for _ in range(3):
  print(f'DATASET {_}')
  (# silhouette_scores, 
   # calinski_h_scores, 
   # davies_b_scores, 
   metric, 
   simgraph,
   sigma, 
   knn, 
   epsi, 
   normalized, 
   iter_with_LinAlgError, 
   iter_with_ValueError, 
   iter_with_RuntimeWarning, 
   iter_with_Errors_Or_Warnings, 
   iter_without_Errors, 
   # eigenvalues_of_one_dataset, 
   dataset_index, 
   ari) = evaluate_single_dataset(_)
  # datasets_results['silhouette_score'].append(silhouette_scores)
  # datasets_results['calinski_h_scores'].append(calinski_h_scores)
  # datasets_results['davies_b_scores'].append(davies_b_scores)
  
  datasets_results['metric'].append(metric)
  datasets_results['sim_graph'].append(simgraph)
  datasets_results['sigma'].append(sigma)
  datasets_results['knn'].append(knn)
  datasets_results['epsi'].append(epsi)
  datasets_results['normalized'].append(normalized)
  
  datasets_results['LinAlgError'].append(iter_with_LinAlgError)
  datasets_results['ValueError'].append(iter_with_ValueError)
  datasets_results['RuntimeWarning'].append(iter_with_RuntimeWarning)
  datasets_results['Errors_Or_Warnings'].append(iter_with_Errors_Or_Warnings)
  datasets_results['no_Errors'].append(iter_without_Errors)
  # datasets_results['eigenvalues'].append(eigenvalues_of_one_dataset)
  datasets_results['dataset_index'].append(dataset_index)
  datasets_results['ari'].append(ari)

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

# Datensatz zur Auswertung

In [None]:
import pandas as pd

In [None]:
df1 = pd.DataFrame(data={
        'Dataset': datasets_results['dataset_index'][0],
        'Iteration': datasets_results['no_Errors'][0],
        'metric': datasets_results['metric'][0],
        'sim_graph': datasets_results['sim_graph'][0],
        'sigma': datasets_results['sigma'][0],
        'knn': datasets_results['knn'][0],
        'epsi': np.hstack(datasets_results['epsi'][0]),#datasets_results['epsi'][0],
        'normalised': datasets_results['normalized'][0],
        # 'Calinski': np.hstack(datasets_results['calinski_h_scores']),
        # 'Davis': np.hstack(datasets_results['davies_b_scores']),
        # 'Silhouette': np.hstack(datasets_results['silhouette_score']),
        'ari': datasets_results['ari'][0],
        })
df2 = pd.DataFrame(data={
        'Dataset': datasets_results['dataset_index'][1],
        'Iteration': datasets_results['no_Errors'][1],
        'metric': datasets_results['metric'][1],
        'sim_graph': datasets_results['sim_graph'][1],
        'sigma': datasets_results['sigma'][1],
        'knn': datasets_results['knn'][1],
        'epsi': np.hstack(datasets_results['epsi'][1]),
        'normalised': datasets_results['normalized'][1],
        # 'Calinski': np.hstack(datasets_results['calinski_h_scores']),
        # 'Davis': np.hstack(datasets_results['davies_b_scores']),
        # 'Silhouette': np.hstack(datasets_results['silhouette_score']),
        'ari': datasets_results['ari'][1],
        })
df3 = pd.DataFrame(data={
        'Dataset': datasets_results['dataset_index'][2],
        'Iteration': datasets_results['no_Errors'][2],
        'metric': datasets_results['metric'][2],
        'sim_graph': datasets_results['sim_graph'][2],
        'sigma': datasets_results['sigma'][2],
        'knn': datasets_results['knn'][2],
        'epsi': np.hstack(datasets_results['epsi'][2]),
        'normalised': datasets_results['normalized'][2],
        # 'Calinski': np.hstack(datasets_results['calinski_h_scores']),
        # 'Davis': np.hstack(datasets_results['davies_b_scores']),
        # 'Silhouette': np.hstack(datasets_results['silhouette_score']),
        'ari': datasets_results['ari'][2],
        })
df_final = pd.concat([df1, df2, df3], ignore_index=True, sort=False)

In [None]:
# save Dataframe as csv in Google Drive
from google.colab import drive
# drive.mount('drive')
# df_final.to_csv('Experiment_1_Test.csv')
# Write the DataFrame to CSV file.
with open('/content/drive/My Drive/Colab Notebooks/Bachelorarbeit/Experiment1-1_Test2.csv', 'w') as f:
  df_final.to_csv(f)
# !cp data.csv "drive/folders/'Meine Ablage'/'Colab Notebooks'/Bachelorarbeit"

## Datensatz importieren

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# from google.colab import drive
# drive.mount('drive')
# df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Bachelorarbeit/Experiment1-1_Test2.csv')
df = pd.read_csv('Experiment1-1.csv')

In [None]:
# Datensatz auf NaN-Werte überprüfen
display(df.info())
df.head()

## ARI-Werte untersuchen

In [None]:
# Skalierung der ARI-Werte untersuchen
display(pd.unique(df.ari))
plt.figure(figsize=(10, 5))
plt.hist(df.ari, bins=len(pd.unique(df.ari)))
plt.show()

In [None]:
df.groupby('Dataset').ari.describe()

## ARI für jeden Datensatz und jede Iteration plotten

In [None]:
plt.figure(figsize=(20.0, 10.0))
plt.title('Adjusted Rand Index', fontsize=14)
plt.subplot(311)
plt.title('Datensatz 1')
aris = df[df.Dataset == 0].ari
plt.plot(aris, 'bo', aris, 'k')
plt.xticks(np.arange(int(aris.index[0]), int(aris.index[len(aris)-1]), 20))

plt.subplot(312)
plt.title('Datensatz 2')
aris = df[df.Dataset == 1].ari
plt.plot(aris, 'bo', aris, 'k')
plt.xticks(np.arange(int(aris.index[0]), int(aris.index[len(aris)-1]), 20))

plt.subplot(313)
plt.title('Datensatz 3')
aris = df[df.Dataset == 2].ari
plt.plot(aris, 'bo', aris, 'k')
plt.xticks(np.arange(int(aris.index[0]), int(aris.index[len(aris)-1]), 20))

plt.show()

## *1.* Untersuchung der Metrik auf die Clusterqualität

Ich betrachte, bei welcher Metrik eine höhere Clustergüte erreicht wird. Dazu betrachte ich eine deskriptive Statistik aller Datensätze zusammen und der einzelnen Datensätze.
Ich untersuche erst jeden einzelnen Datensatz und anschließend alle zusammen.

In [None]:
df.groupby('metric').ari.describe().round(4)

In [None]:
df.groupby([ 'Dataset', 'metric']).ari.describe().round(4)

## *2.* Einfluss des Ähnlichkeitsgraphen auf die Clusterqualität

In [None]:
df.groupby('sim_graph').ari.describe()

In [None]:
df.groupby([ 'Dataset', 'sim_graph']).ari.describe().round(4)

In [None]:
df[df.sim_graph == 'fully_connect'].groupby([ 'Dataset', 'metric']).ari.describe().round(4)

# *3.* Einfluss von $\sigma$ auf die Clusterqualität
Der Einfluss soll mit der Korrelation berechnet werden. Um einzuschätzen, welches Korrelationsmaß verwendet werden soll und damit ich eine Einschätzung über die Richtung der Korrelation habe, plotte ich sigma und die Clustergüte in einem Scatterplot.


In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize=(5, 5))
sns.scatterplot(data=df[df.sigma != -1], x='sigma', y='ari')
plt.show()

In [None]:
df[df.sigma != -1].sigma.corr(df[df.sigma != -1].ari, method='spearman')

In [None]:
# Was sind die häufigsten Werte? 
ari_round_unique = np.unique(np.round(df[df.sigma != 1].ari,2), return_counts=True)
# sortiere ari_round nach der Häufigkeit
ari_round_unique_sorted = np.array([x for _,x in sorted(zip(ari_round_unique[1],ari_round_unique[0]), reverse=True)])
ari_round_unique_sorted

In [None]:
df2 = df[df.sigma != -1].copy()
df2.ari = np.round(df2.ari, 2)
# eine Kopie von df2 mit den ersten 4 Sigma-Werten von ari_round_unique_sorted erstellen
df3 = df2[df2.ari.isin(ari_round_unique_sorted[:4])].copy()
# df3.groupby(['ari', 'Dataset','normalised', 'metric']).ari.describe()
# df3.groupby(['ari', 'metric']).sigma.describe()
# df3.groupby(['sigma']).ari.describe()
df3.groupby(['Dataset', 'ari']).sigma.describe().round(4)

In [None]:
df[df.Dataset == 0].shape

# *4.1* Einfluss von $k$ beim knn-Graphen auf die Clusterqualität

In [None]:
# plotte die Spalte k, bei dem der sim-graph knn ist und die Clustergüte in einem Scatterplot
plt.figure(figsize=(5, 5))
sns.scatterplot(data=df[df.sim_graph == 'knn'], x='knn', y='ari')
plt.show()

# *4.2* Einfluss von $k$ beim mutual-knn-Graphen auf die Clusterqualität

In [None]:
plt.figure(figsize=(5, 5))
sns.scatterplot(data=df[df.sim_graph == 'mutual_knn'], x='knn', y='ari')
plt.show()

In [None]:
# Wie viele Iterationen wurden insgesammt mit dem mutual-knn erstellt.
df[df.sim_graph == 'mutual_knn'] 
# df[(df.sim_graph == 'mutual_knn') & (df.ari != 1.0) & (df.ari != 0.0)]

In [None]:
df[(df.normalised == 2) & (df.sim_graph == 'mutual_knn')]

# *5.* Einfluss von $\epsilon$ auf die Clusterqualität

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df[df.sim_graph == 'eps_neighbor'], x='epsi', y='ari')
plt.show()

In [None]:
df[df.sim_graph == 'eps_neighbor']

Ab einem bestimmten epsi-Wert nimmt die Clustergüte ab. Ich filtere die epsi-Werte heraus, dei dem der ARI-Wert größer 0.8 ist.

In [None]:
# return epsi values with ari >= 0.8 and ari <= 0.8
epsi_ari_better_model = df[(df.ari >= 0.8) & (df.epsi != -1)].epsi.unique()
display(pd.DataFrame(epsi_ari_better_model).describe())
epsi_ari_worse_model = df[(df.ari < 0.8) & (df.epsi != -1)].epsi.unique()
display(pd.DataFrame(epsi_ari_worse_model).describe())

In [None]:
df[df.epsi != -1].epsi.corr(df[df.epsi != -1].ari, method='spearman')

# *6.* Einfluss vom Laplace-Graphen auf die Clusterqualität

In [None]:
df.groupby('normalised').ari.describe().round(4)

In [None]:
df.groupby(['Dataset', 'normalised']).ari.describe().round(4)

In [None]:
plt.figure(figsize=(, 5))
sns.boxplot(data=df, x='normalised', y='ari', hue='Dataset')
plt.show()

# Zusammenfassung

In [None]:
a = df.groupby([ 'Dataset', 'metric']).ari.describe().round(4)
b = df.groupby([ 'Dataset', 'sim_graph']).ari.describe().round(4)
c = df[df.sim_graph == 'fully_connect'].groupby([ 'Dataset', 'metric']).ari.describe().round(4)
c.index = pd.MultiIndex.from_tuples([(x[0], "fully_" + x[1]) for x in c.index])
d = df.groupby(['Dataset', 'normalised']).ari.describe().round(4)

frames = [a,b,c,d]
summary = pd.concat(frames)
summary = summary.rename({"metric":"parameter"})

In [None]:
summary.sort_values(by=['mean','std'], ascending=[False, True])[['count', 'mean', 'std']]

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std
Dataset,metric,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,fully_euclidean,90.0,1.0,0.0
0,fully_connect,180.0,0.9607,0.0394
0,fully_correlation,90.0,0.9213,0.0
0,knn,102.0,0.908,0.1985
0,correlation,182.0,0.8201,0.2888
0,2,145.0,0.7872,0.3577
0,1,149.0,0.7745,0.3758
0,euclidean,278.0,0.6965,0.446
0,3,166.0,0.6828,0.4371
2,knn,102.0,0.6261,0.322


In [None]:
df.groupby(['Dataset', 'sim_graph', 'metric', 'normalised']).ari.describe().round(4).sort_values(by=['Dataset','mean','std'], ascending=[True, False, True])[['count','mean', 'std']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count,mean,std
Dataset,sim_graph,metric,normalised,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,fully_connect,euclidean,1,30.0,1.0000,0.0000
0,fully_connect,euclidean,2,30.0,1.0000,0.0000
0,fully_connect,euclidean,3,30.0,1.0000,0.0000
0,knn,euclidean,1,17.0,0.9412,0.2425
0,knn,euclidean,3,17.0,0.9238,0.2205
...,...,...,...,...,...,...
2,fully_connect,euclidean,3,30.0,0.3406,0.2620
2,eps_neighbor,correlation,2,8.0,0.3279,0.1308
2,eps_neighbor,correlation,1,8.0,0.3273,0.1323
2,eps_neighbor,correlation,3,8.0,0.3272,0.1326
