<h1>
<center>Clustering </center>
</h1>

## Generals

<font size="3"> 
Packages import and system configurations. 
</font>

In [1]:
from keras.datasets import mnist
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.metrics import silhouette_score
from sklearn.cluster import SpectralClustering, KMeans

## Data Loading & Preprocessing 

<font size="3">  
A function that gives us information about data shapes and reshapes the data in order to be suitable for our models.
</font>

In [2]:
def data_reshape(x_train,y_train,x_test,y_test):
    print ('Basic informations:')
    print('X_train: ' + str(x_train.shape))
    print('Y_train: ' + str(y_train.shape))
    print('X_test:  ' + str(x_test.shape))
    print('Y_test:  ' + str(y_test.shape))
    x_train = x_train.reshape(x_train.shape[0], np.prod(x_train.shape[1:])) 
    x_test = x_test.reshape(x_test.shape[0], np.prod(x_test.shape[1:]))  
    # Change integers to 32-bit floating point numbers
    x_train = x_train.astype('float32')   
    x_test = x_test.astype('float32')
    print("\nData shapes after reshaping:")
    print("Training matrix shape", x_train.shape)
    print("Testing matrix shape", x_test.shape)
    return x_train,y_train,x_test,y_test

<font size="3">
A function that provides us with the input data:
<ol>
<li>Load the necessary data according to the give to the given data name.</li>
<li>Create a subset for each data according to the given data sizes (If subset variable = 'True").</li>
<li>Use the above function and returns the reshaped data.</li>
</font>

In [3]:
def data_load(subset,train_subset_size,test_subset_size):
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    if subset:
        x_train,y_train,x_test,y_test = x_train[:train_subset_size],y_train[:train_subset_size],x_test[:test_subset_size],y_test[:test_subset_size]
        x_train,y_train,x_test,y_test = data_reshape(x_train,y_train,x_test,y_test)
    else:
        x_train,y_train,x_test,y_test = data_reshape(x_train,y_train,x_test,y_test)
        
    return x_train,y_train,x_test,y_test   

<font size="3">  
A function that applies standardization on the given data and returns it.
</font>

In [4]:
def scalling(x_train,x_test):
    x_train = StandardScaler().fit_transform(x_train)
    x_test = StandardScaler().fit_transform(x_test)
    return x_train,x_test

<font size="3">  
A function that aplly clustering with the given method and with the given k and returns the silhouette score
</font>

In [5]:
def cluster_pipeline(data_input,method,k):
    if method == 'K-Means':
        cluster_alg = KMeans(n_clusters=k)
    elif method == 'Spectral-Clustering':
        cluster_alg = SpectralClustering(n_clusters=k, affinity='nearest_neighbors', random_state=0)
    
    clustering = cluster_alg.fit(data_input)
    cluster_assignments = clustering.labels_
    silhouette = silhouette_score(data_input,cluster_assignments)  
    return silhouette

<font size="3">  
A function that runs all the experiments using a for loop
</font>

In [6]:
def k_experiments(data_input,method,n_clusters):
    all_results = []
    for k in n_clusters:
        silhouette = cluster_pipeline(data_input,method,k)
        experiment = []
        experiment.append(method)
        experiment.append(k)
        experiment.append(silhouette)
        print ('Algorithm: ',experiment[0],', K:',experiment[1],', Silhouette Score:',experiment[2])
        all_results.append(experiment)     
    return all_results    

## MNIST Dataset

<font size="3">
In the following cells we use the above functions to apply clustering algorithms to a subset of Mnist-dataset.
</font>

### Define Variables

In [7]:
train_subset_size = 5000
test_subset_size = 2000
different_k = np.arange(6,16,1)

### Data Loading, Preprocessing

In [8]:
x_train,y_train,x_test,y_test  = data_load(True,train_subset_size,test_subset_size)
x_train,x_test = scalling(x_train,x_test)

Basic informations:
X_train: (5000, 28, 28)
Y_train: (5000,)
X_test:  (2000, 28, 28)
Y_test:  (2000,)

Data shapes after reshaping:
Training matrix shape (5000, 784)
Testing matrix shape (2000, 784)


### EXPERIMENTS

In [9]:
all_results = k_experiments(x_train,'K-Means',different_k)
print ('\n')
all_results = k_experiments(x_train,'Spectral-Clustering',different_k)

Algorithm:  K-Means , K: 6 , Silhouette Score: 0.0140361
Algorithm:  K-Means , K: 7 , Silhouette Score: 0.011110065
Algorithm:  K-Means , K: 8 , Silhouette Score: 0.008026962
Algorithm:  K-Means , K: 9 , Silhouette Score: 0.027330102
Algorithm:  K-Means , K: 10 , Silhouette Score: 0.014517139
Algorithm:  K-Means , K: 11 , Silhouette Score: 0.012568094
Algorithm:  K-Means , K: 12 , Silhouette Score: 0.01579247
Algorithm:  K-Means , K: 13 , Silhouette Score: -0.010654097
Algorithm:  K-Means , K: 14 , Silhouette Score: -0.005366548
Algorithm:  K-Means , K: 15 , Silhouette Score: 0.003461831


Algorithm:  Spectral-Clustering , K: 6 , Silhouette Score: 0.057583276
Algorithm:  Spectral-Clustering , K: 7 , Silhouette Score: -0.07768214
Algorithm:  Spectral-Clustering , K: 8 , Silhouette Score: -0.0713104
Algorithm:  Spectral-Clustering , K: 9 , Silhouette Score: -0.06938579
Algorithm:  Spectral-Clustering , K: 10 , Silhouette Score: -0.06417128
Algorithm:  Spectral-Clustering , K: 11 , Silhou