# Assignment 3

## Question

Dataset:
1. “Diabetes.arff” file contains the dataset.
2. Each row has 9 comma separated values where first 8 values represent asingle datapoint (8 dim vector values). Ignore the 9th value.

Questions:
There are two parameters in DBSCAN algorithm:
a. Eps: radius length
b. minPts: minimum number of points required to form a cluster.
1. Implement DBSCAN algorithm and find number of clusters formed for eps = 2 and minPts = 5
2. For any one cluster, show its core point and border points.

## The Solution

### Importing the libraries

In [1]:
from scipy.io import arff
import pandas as pd
import math
from collections import OrderedDict
from sklearn.preprocessing import StandardScaler, MinMaxScaler

### Import data file in form of DataFrame

In [2]:
df = pd.read_csv('diabetes.csv', header=None)
del df[8]
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,35.294118,74.371859,59.016393,35.353535,0.0,50.074516,23.441503,48.333333
1,5.882353,42.713568,54.098361,29.292929,0.0,39.642325,11.656704,16.666667
2,47.058824,91.959799,52.459016,0.0,0.0,34.724292,25.362938,18.333333
3,5.882353,44.723618,54.098361,23.232323,11.111111,41.877794,3.800171,0.0
4,0.0,68.844221,32.786885,35.353535,19.858156,64.232489,94.363792,20.0


### Normalizing the features with StandardScaler 

In [3]:
scaler = StandardScaler()
df = pd.DataFrame(scaler.fit_transform(df))
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.639947,0.848324,0.149641,0.90727,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.23388,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.90727,0.765836,1.409746,5.484909,-0.020496


### Function to calculate Euclidean distance between two points

In [4]:
def dist(pointX, pointY):
    disSquare = 0
    for i in range(len(pointX)):
        if (i == 8):
            break
        
        disSquare += (pointX[i] - pointY[i]) ** 2
    
    return math.sqrt(disSquare)

### Function to get the neighbours(points in the epsilon neighbourhood) of a point 

In [5]:
def getNeighbours(database, point, eps, idx):
    neighbours = []
    
    for i in range(len(database)):
        if i == idx:
            continue
        
        if (dist(point, database[i]) <= eps):
            neighbours.append(i)
    
    return neighbours

### Part 1 : Implementing the DBSCAN Clustering Algorithm 

#### Running the algorithm

In [6]:
eps = 2
minPts = 5
cur_cluster_label = -1
database = []

for index, row in df.iterrows():
    rowX = []
    for x in row:
        rowX.append(x)
    rowX.append('Undefined')
    rowX.append(-1)
    database.append(rowX)


for idx in range(len(database)):
    if (database[idx][-1] != -1):
        continue
        
    neighbours = getNeighbours(database, database[idx], eps, idx)
    if (len(neighbours) + 1 < minPts):
        database[idx][-2] = 'Noise'
        continue
    
    cur_cluster_label += 1
    database[idx][-2] = 'Core'
    database[idx][-1] = cur_cluster_label
    
    for x in neighbours:
        if database[x][-2] == 'Noise':
            database[x][-1] = cur_cluster_label
            database[x][-2] = 'Border'
        
        if database[x][-1] != -1:
            continue
            
        database[x][-1] = cur_cluster_label
        database[x][-2] = 'Border'
        neighboursY = getNeighbours(database, database[x], eps, x)
        
        if (len(neighboursY) + 1 >= minPts):
            for y in neighboursY:
                database[x][-2] = 'Core'
                neighbours.append(y)
                
clusters = {}
for idx in range(len(database)):
    if database[idx][-1] not in clusters:
        clusters[database[idx][-1]] = []
    
    clusters[database[idx][-1]].append(idx)

clusters = OrderedDict(sorted(clusters.items(), key=lambda x: x[0]))

#### Printing the clusters formed along with outliers present if any

In [7]:
for i in clusters.keys():
    if i == -1:
        print("Outliers" + " -> " + str(clusters[i]))
        print("Clusters:")
    else:    
        print(str(i) + " -> " + str(clusters[i]))
        print(" ")

Outliers -> [4, 8, 9, 13, 43, 45, 58, 75, 145, 177, 182, 193, 220, 228, 231, 247, 254, 342, 349, 357, 362, 370, 371, 445, 453, 459, 502, 519, 537, 549, 579, 622, 661, 684, 702]
Clusters:
0 -> [0, 1, 2, 3, 5, 6, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 46, 47, 48, 50, 51, 52, 53, 54, 55, 56, 57, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 76, 77, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 178, 179, 180, 181, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 194, 195, 196, 197,

### Part 2 : For any one cluster, show its core point and border points

#### Printing the Core and Border points for Cluster 0

In [8]:
print("Index\tPoint Type")
for i in clusters[0]:
    print(str(i) + "\t" + database[i][-2])

Index	Point Type
0	Core
1	Core
2	Core
3	Core
5	Core
6	Core
10	Core
11	Core
12	Border
14	Core
16	Core
17	Core
18	Core
19	Core
20	Core
21	Core
22	Core
23	Core
24	Core
25	Core
26	Core
27	Core
28	Border
29	Core
30	Core
31	Core
32	Core
33	Core
34	Core
35	Core
36	Core
37	Core
38	Core
39	Border
40	Core
41	Core
42	Core
44	Core
46	Core
47	Core
48	Core
50	Core
51	Core
52	Core
53	Core
54	Core
55	Core
56	Core
57	Core
59	Core
61	Core
62	Core
63	Core
64	Core
65	Core
66	Core
67	Border
68	Core
69	Core
70	Core
71	Core
72	Core
73	Core
74	Core
76	Core
77	Core
79	Core
80	Core
82	Core
83	Core
84	Core
85	Core
86	Core
87	Core
88	Core
89	Core
90	Core
91	Core
92	Core
93	Core
94	Core
95	Core
96	Core
97	Core
98	Core
99	Core
100	Border
101	Core
102	Core
103	Core
104	Core
105	Core
106	Border
107	Core
108	Core
109	Core
110	Core
111	Core
112	Core
113	Core
114	Core
115	Core
116	Core
117	Core
118	Core
119	Core
120	Border
121	Core
122	Core
123	Core
124	Core
125	Border
126	Core
127	Core
128	Core
129	Core
130	Core
131	Co