# *Computing DBSCAN clustering on similarity matrices with parameter combinations*

**Author: Lucía Prieto Santamaría **(lucia.prieto.santamaria@alumnos.upm.es)

This notebook was written to develop complete DBSCAN clustering analysis on similarity matrices stored in csv files. It explores the values of *Silhouette* given different values of Eps (epsilon) and MinPts (ms).

In [1]:
# Import of the needed libraries
import csv # Module to get the data from the similarity matrices stored in csv files
import numpy as np # Library needed to structure the data before implementing the algorithm
from sklearn.cluster import DBSCAN # Extension of scikit-learn that implements the algorithm DBSCAN
from sklearn import metrics # Extension of scikit-learn that will be used to compute silhouette coefficient

In [2]:
# VARIABLES DECLARATION

# We need to specify here the total number of diseases in the subset we are working with
number_of_diseases = 3671

# Folder and group name to identify the correct directory
# IMPORTANT!!!! THE FOLDER NEEDS TO BE PREVIOUSLY CREATED BY THE USER
directory_name = 'weighted_score/wesco'

# Similarity metrics that 
sim_metrics = (
   'S_gen_cos',
   'S_gen_jaccard',
   'S_gen_dice',
   'S_prot_cos',
   'S_prot_jaccard',
   'S_prot_dice',
   'S_path_cos',
   'S_path_jaccard',
   'S_path_dice',
   'S_ppi_cos',
   'S_ppi_jaccard',
   'S_ppi_dice',
   'S_term_cos',
   'S_term_jaccard',
   'S_term_dice'
   )


# Eps (epsilon) is one of the parameters that have to be declared to DBSCAN. 
# We set the following values to explore them:
epsilons = (
        0.3,
        0.4,
        0.5,
        0.6,
        0.65,
        0.7,
        0.75,
        0.8,
        0.85,
        0.9,
        0.95,
        0.99
        )

# MinPts (ms) is the other parameter for DBSCAN. We will set the following values 
ms_values = (
            2, 
            3, 
            5, 
            30)

# Dictionaries that as keys will contain a list of the combinations of the corresponding value of Eps, MinPts and the
# specific similarity metric.
n_clusters = {} # As values contains the number of clusters generated for the combination of the key's parameters
silh_coef = {} # As values contains the value of Silhouette computed for the combination of the key's parameters

In [3]:
# Getting the list of diseases we are going to work with

file_name = 'excels/' + directory_name + str(sim_metrics[0]) + '.csv'

with open(file_name) as f:
    reader = csv.reader(f, delimiter = ",")
    diseases = next(reader)

diseases.pop(0)
    
f.close()

In [4]:
# PERFORMING CLUSTERING ON SIMILARITY MATRICES

columns = tuple(range(1, number_of_diseases + 1)) # Columns to get from the similarity matrix excel file


for metric in sim_metrics:
    
    distance = str(metric[2:len(metric)])
    print ("\n\nWorking on %s" %metric)
    
    file_name = 'excels/' + directory_name + str(metric) + '.csv'
    my_data = np.genfromtxt(file_name, delimiter= ",", skip_header = 1, usecols = columns)
    
    X = 1 - my_data # Convert similarity measure into distance
    
    for ms in ms_values:    
        
        print ("\n----> MinPts %s" %ms)
        
        for epsilon in epsilons:   
    
            labels = DBSCAN(eps=epsilon, min_samples=ms, metric='precomputed').fit_predict(X)
        
            if not 0 in labels:
                n_clusters[epsilon, ms, metric] = 0
                n_noise = list(labels).count(-1)
                silh_coef[epsilon, ms, metric] = -1
         
         
            else:
                # Number of clusters in labels, ignoring noise if present.
                n_clusters[epsilon, ms, metric] =  len(set(labels)) - (1 if -1 in labels else 0)
                n_noise = list(labels).count(-1)
                print("\tNumber of clusters: ", n_clusters[epsilon, ms, metric])
                print("\tNumber of outliers: ", n_noise)
        
                try:
                    silh_coef[epsilon, ms, metric] = metrics.silhouette_score(X, labels, metric='precomputed')
                    print("\t\t",silh_coef[epsilon, ms, metric])
                except ValueError:
                    silh_coef[epsilon, ms, metric] = -1



Working on S_gen_cos

----> MinPts 2
	Number of clusters:  437
	Number of outliers:  2164
		 0.18383059513673353
	Number of clusters:  463
	Number of outliers:  1898
		 0.2012604278659193
	Number of clusters:  388
	Number of outliers:  1516
		 0.1547509991121918
	Number of clusters:  256
	Number of outliers:  1142
		 0.07638056053471447
	Number of clusters:  180
	Number of outliers:  957
		 0.05148726401344943
	Number of clusters:  118
	Number of outliers:  787
		 0.03573924105587854
	Number of clusters:  72
	Number of outliers:  654
		 0.029838211246256255
	Number of clusters:  45
	Number of outliers:  562
		 0.024186198482233873
	Number of clusters:  28
	Number of outliers:  496
		 0.01598671643276528
	Number of clusters:  17
	Number of outliers:  423
		 0.011521357634261813
	Number of clusters:  11
	Number of outliers:  385
		 0.008763817911931315
	Number of clusters:  6
	Number of outliers:  355
		 0.005331107071604288

----> MinPts 3
	Number of clusters:  201
	Number of outliers

		 0.03379742287122019
	Number of clusters:  59
	Number of outliers:  1346
		 0.016814083756859124
	Number of clusters:  46
	Number of outliers:  1104
		 0.016463142967514364
	Number of clusters:  19
	Number of outliers:  908
		 0.00972486661778741
	Number of clusters:  11
	Number of outliers:  752
		 0.0064416196300505196
	Number of clusters:  8
	Number of outliers:  591
		 0.00540323361112773
	Number of clusters:  3
	Number of outliers:  470
		 0.002735169944638096

----> MinPts 5
	Number of clusters:  19
	Number of outliers:  3541
		 0.0036288447850918127
	Number of clusters:  32
	Number of outliers:  3427
		 0.015546867350947225
	Number of clusters:  61
	Number of outliers:  3186
		 0.03173841814450823
	Number of clusters:  88
	Number of outliers:  2739
		 0.04186393333604027
	Number of clusters:  83
	Number of outliers:  2442
		 0.03382638144208217
	Number of clusters:  59
	Number of outliers:  2052
		 0.02264537596648071
	Number of clusters:  31
	Number of outliers:  1679
		 0.01

	Number of clusters:  33
	Number of outliers:  1735
		 0.012163711979011576
	Number of clusters:  12
	Number of outliers:  1202
		 0.004890561766318865
	Number of clusters:  4
	Number of outliers:  798
		 0.0024411080538182797
	Number of clusters:  3
	Number of outliers:  527
		 0.0023348731030062237

----> MinPts 30
	Number of clusters:  1
	Number of outliers:  3626
		 0.00262132026004639
	Number of clusters:  1
	Number of outliers:  3580
		 0.004878800108288286
	Number of clusters:  1
	Number of outliers:  3566
		 0.005175357318626855
	Number of clusters:  1
	Number of outliers:  3562
		 0.005239530333684847
	Number of clusters:  2
	Number of outliers:  3511
		 0.005553087115878228
	Number of clusters:  3
	Number of outliers:  3393
		 0.006889323198050415
	Number of clusters:  1
	Number of outliers:  3046
		 0.0021621123080209437
	Number of clusters:  1
	Number of outliers:  1545
		 0.0012925162484424982
	Number of clusters:  1
	Number of outliers:  658
		 0.00103774260126057


Worki

		 0.07336589613319991


Working on S_path_jaccard

----> MinPts 2
	Number of clusters:  347
	Number of outliers:  2411
		 0.0837119587809754
	Number of clusters:  342
	Number of outliers:  2041
		 0.10388908746385218
	Number of clusters:  262
	Number of outliers:  1603
		 0.047191677200056896
	Number of clusters:  148
	Number of outliers:  1274
		 1.0296286869127216e-05
	Number of clusters:  98
	Number of outliers:  1106
		 -0.010833063122470952
	Number of clusters:  35
	Number of outliers:  950
		 -0.01569169311468265
	Number of clusters:  6
	Number of outliers:  876
		 0.01013164860183354
	Number of clusters:  3
	Number of outliers:  840
		 0.014199844743443152
	Number of clusters:  2
	Number of outliers:  825
		 0.01436251596515025
	Number of clusters:  1
	Number of outliers:  821
		 0.014843418564648384
	Number of clusters:  1
	Number of outliers:  818
		 0.014832260268751736
	Number of clusters:  1
	Number of outliers:  816
		 0.014822182208877637

----> MinPts 3
	Number of clust

In [5]:
# WRITING THE RESULTS IN FILES

print('Writing results combinations in files... ')

for m in ms_values:
    
    file_number_clust = open("results/" + directory_name + "Num_clusters_minsample" + str(m) + ".csv", "w")
    file_silhouette = open("results/" + directory_name + "Silhouette_coefficient_minsample" + str(m) + ".csv", "w")
       
    for e in epsilons:
    
        file_number_clust.write(str(e))
        file_silhouette.write(str(e))
    
        for met in sim_metrics:
        
            file_number_clust.write("," + str(n_clusters[e, m, met]))
            file_silhouette.write(",%.4f" %silh_coef[e, m, met])

        file_number_clust.write("\n")
        file_silhouette.write("\n")
        
    file_number_clust.close()
    file_silhouette.close()


print('\tDONE!')

Writing results combinations in files... 
	DONE!
