# Unsupervised Learning: Clustering Lab





In [1]:
from sklearn.base import BaseEstimator, ClassifierMixin, ClusterMixin
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

from scipy.io import arff
import pandas as pd
from IPython.core.display import display
import pprint
pp = pprint.PrettyPrinter(indent=4)
from scipy.spatial.distance import cdist
import gc

In [3]:
# HELPER FUNCTIONS

def load_data(filename):
    data = arff.loadarff(filename)
    df = pd.DataFrame(data[0])

    for i in range(len(df.dtypes)):
        if df.dtypes.astype(str).iloc[i] == 'object':
            column = df.columns[i]
            df[column] = df[column] \
                            .astype(str).str \
                            .split("\'", expand=True) \
                            .iloc[:,1]
    return df

def normalize(data):
    return data / data.max(axis=0)

## 1. (50%) Implement the k-means clustering algorithm and the HAC (Hierarchical Agglomerative Clustering) algorithm.

### 1.1.1 HAC

### Code requirements 
- HAC should support both single link and complete link options.
- HAC automatically generates all clusterings from n to 1.  To simplify the amount of output you may want to implement a mechanism to specify for which k values actual output will be generated.


---
The output should include the following:
- The number of clusters (k).
- The silhouette score of the full clustering. (You can either write and use your own silhouette_score function (extra credit) or use sklearn's)


For each cluster report include:


- The centroid id.
- The number of instances tied to that centroid. 
---

In [5]:
class HACClustering(BaseEstimator,ClusterMixin):

    def __init__(self,k=3,link_type='single'): ## add parameters here
        """
        Args:
            k = how many final clusters to have
            link_type = single or complete. when combining two clusters use complete link or single link
        """
        self.link_type = link_type
        self.k = k
        self.clusters = []
        
    def fit(self, X, y=None):
        """ Fit the data; In this lab this will make the K clusters :D
        Args:
            X (array-like): A 2D numpy array with the training data
            y (array-like): An optional argument. Clustering is usually unsupervised so you don't need labels
        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)
        """

        # Put all instances in their own cluster
        if y is not None:
            X = np.append(X, y.reshape(-1,1), axis=1)
        self.init_clusters(X)

        # Combine clusters one at a time until desired cluster count is reached
        while len(self.clusters) > self.k:
            gc.collect()
            self.combine_two_clusters()
            print(len(self.clusters))

        return self

    def combine_two_clusters(self):
        
        min_cluster_dist = np.inf
        min_dist_clusters_inds = (None, None)

        for i in range(len(self.clusters)):
            c1 = self.clusters[i]
            
            for j in range(len(self.clusters)):
                if i == j:
                    continue

                c2 = self.clusters[j]
                distance_metric = self.cluster_distance(c1, c2)

                if distance_metric < min_cluster_dist:
                    min_cluster_dist = distance_metric
                    min_dist_clusters_inds = (i, j)

        new_clusters = []
        c1,  c2 = self.clusters[min_dist_clusters_inds[0]], self.clusters[min_dist_clusters_inds[1]]
        combined_cluster = np.append(c1, c2, axis=0)
        new_clusters.append(combined_cluster)

        for ind in range(len(self.clusters)):
            if ind != min_dist_clusters_inds[0] and ind != min_dist_clusters_inds[1]:
                new_clusters.append(self.clusters[ind])

        self.clusters = new_clusters

    def cluster_distance(self, c1, c2):
        dist = cdist( c1, c2, metric='euclidean' )  # -> (nx, ny) distances
        if self.link_type == 'single':
            return np.min(dist)
        elif self.link_type == 'complete':
            return np.max(dist)

    def init_clusters(self, instances):
        for i in range(instances.shape[0]):
            cluster = instances[i, :].reshape(1, -1)
            self.clusters.append(cluster)
    
    def print_clusters(self):
        """
            Used for grading.
            print("Num clusters: {:d}\n".format(k))
            print("Silhouette score: {:.4f}\n\n".format(silhouette_score))
            for each cluster and centroid:
                print(np.array2string(centroid,precision=4,separator=","))
                print("{:d}\n".format(size of cluster))
        """
        print(self.clusters)

### 1.1.2 Debug 

Debug your model by running it on the [Debug Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/abalone.arff)


---
The dataset was modified to be a lot smaller. The last datapoint should be on line 359 or the point 0.585,0.46,0.185,0.922,0.3635,0.213,0.285,10. The remaining points should be commented out.


- Make sure to include the output class (last column) as an additional input feature
- Normalize Data
- K = 5
- Use 4 decimal places and DO NOT ROUND when reporting silhouette score and centroid values.


---
Solutions in files:

[Debug HAC Single (Silhouette).txt](https://raw.githubusercontent.com/cs472ta/CS472/master/debug_solutions/Debug%20HAC%20Single%20Link%20%28Silhouette%29.txt)

[Debug HAC Complete (Silhouette).txt](https://raw.githubusercontent.com/cs472ta/CS472/master/debug_solutions/Debug%20HAC%20Complete%20Link%20%28Silhouette%29.txt)

In [6]:
# Debug Here
dbDF = load_data("datasets/abalone.arff")
display(dbDF.head(2))
display(dbDF.tail(2))

X_db = dbDF.iloc[:, :-1].to_numpy()
y_db = dbDF.iloc[:, -1].to_numpy()

print(X_db.shape)
print(y_db.shape)

X_db = normalize(X_db)
y_db = normalize(y_db.reshape(1, -1)).reshape(-1)

hac = HACClustering(k=5)
hac.fit(X_db, y_db)

Unnamed: 0,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15.0
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7.0


Unnamed: 0,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
198,0.56,0.45,0.16,0.922,0.432,0.178,0.26,15.0
199,0.585,0.46,0.185,0.922,0.3635,0.213,0.285,10.0


(200, 7)
(200,)
199
198
197
196
195
194
193
192
191
190
189
188
187
186
185
184
183
182
181
180
179
178
177
176
175
174
173
172
171
170
169
168
167
166
165
164
163
162
161
160
159
158
157
156
155
154
153
152
151
150
149
148
147
146
145
144
143
142
141
140
139
138
137
136
135
134
133
132
131
130
129
128
127
126
125
124
123
122
121
120
119
118
117
116
115
114
113
112
111
110
109
108
107
106
105
104
103
102
101
100
99
98
97
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5


HACClustering(k=5)

In [7]:
pp.pprint(hac.clusters)

[   array([[1.        , 0.97391304, 0.91304348, 0.83960784, 0.6071929 ,
        0.73567468, 1.        , 1.        ],
       [1.        , 1.        , 0.76086957, 0.83294118, 0.71461934,
        0.83456562, 0.84577114, 1.        ]]),
    array([[0.97931034, 0.93913043, 0.7173913 , 0.76823529, 0.71602055,
        0.48243993, 0.7761194 , 1.        ],
       [0.97241379, 0.97391304, 0.95652174, 0.77686275, 0.76366184,
        0.5702403 , 0.75621891, 1.        ]]),
    array([[0.89655172, 0.90434783, 0.82608696, ..., 0.56561922, 0.44427861,
        1.        ],
       [0.87586207, 0.89565217, 0.82608696, ..., 0.56377079, 0.44776119,
        1.        ],
       [0.86896552, 0.86956522, 0.80434783, ..., 0.61275416, 0.37810945,
        1.        ],
       ...,
       [0.88965517, 0.84347826, 0.93478261, ..., 0.48336414, 0.6318408 ,
        1.        ],
       [0.93793103, 0.95652174, 0.76086957, ..., 0.72550832, 0.45273632,
        1.        ],
       [0.89655172, 0.94782609, 1.        , ..., 0

### 1.1.3 Evaluation

We will evaluate your model based on its print_clusters() output using [Evaluation Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/seismic-bumps_train.arff)

- Make sure to include the output class (last column) as an additional input feature
- Normalize Data
- K = 5
- Use 4 decimal places and DO NOT ROUND when reporting silhouette score and centroid values.

#### 1.1.3.1 Complete Link

In [None]:
# Load evaluation data

# Train on evaluation data using complete link

# Print clusters

#### 1.1.3.1 Single Link

In [None]:
# Load evaluation data

# Train on evaluation data using single link

# Print clusters

### 1.2.1 K-Means

### Code requirements 
- Ability to choose k and specify k initial centroids
- Use Euclidean Distance as metric
- Ability to handle distance ties
- Include output label as a cluster feature


---
The output should include the following:
- The number of clusters (k).
- The silhouette score of the full clustering. (You can either write and use your own silhouette_score function (extra credit) or use sklearn's)


For each cluster report include:


- The centroid id.
- The number of instances tied to that centroid. 
---
You only need to handle continuous features

In [1]:
class KMEANSClustering(BaseEstimator,ClusterMixin):

    def __init__(self,k=3,debug=False): ## add parameters here
        """
        Args:
            k = how many final clusters to have
            debug = if debug is true use the first k instances as the initial centroids otherwise choose random points as the initial centroids.
        """
        self.k = k
        self.debug = debug

    def fit(self, X, y=None):
        """ Fit the data; In this lab this will make the K clusters :D
        Args:
            X (array-like): A 2D numpy array with the training data
            y (array-like): An optional argument. Clustering is usually unsupervised so you don't need labels
        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)
        """
        return self
    
    def print_clusters(self):
        """
            Used for grading.
            print("Num clusters: {:d}\n".format(k))
            print("Silhouette score: {:.4f}\n\n".format(silhouette_score))
            for each cluster and centroid:
                print(np.array2string(centroid,precision=4,separator=","))
                print("{:d}\n".format(size of cluster))
        """
        pass

### 1.2.2 Debug 

Debug your model by running it on the [Debug Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/abalone.arff)


- Train until convergence
- Make sure to include the output class (last column) as an additional input feature
- Normalize Data
- K = 5
- Use the first k instances as the initial centroids
- Use 4 decimal places and DO NOT ROUND when reporting silhouette score and centroid values




---
Solutions in files:

[Debug K Means (Silhouette).txt](https://raw.githubusercontent.com/cs472ta/CS472/master/debug_solutions/Debug%20K%20Means%20%28Silhouette%29.txt)

In [None]:
# Load debug data

# Train on debug data

# Print clusters

### 1.2.3 Evaluation

We will evaluate your model based on its print_clusters() output using [Evaluation Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/seismic-bumps_train.arff)
- Train until convergence
- Make sure to include the output class (last column) as an additional input feature
- Normalize Data
- K = 5
- Use the first k instances as the initial centroids
- Use 4 decimal places and DO NOT ROUND when reporting silhouette score and centroid values

In [None]:
# Load evaluation data

# Train on evaluation data

# Print clusters

## 2.1.1 (7.5%) Clustering the Iris Classification problem - HAC

Load the Iris Dataset [Iris Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/iris.arff)

- Use single-link and complete link clustering algorithms
- State whether you normalize your data or not (your choice).  
- Show your results for clusterings using k = 2-7.  
- Graph the silhouette score for each k and discuss your results (i.e. what kind of clusters are being made).
---

In [None]:
# Iris Classification using single-link

In [None]:
# Iris Classification using complete-link

Discuss differences between single-link and complete-link

## 2.1.2 (5%) Clustering the Iris Classification problem - HAC

Requirements:
- Repeat excercise 2.1.1 and include the output label as one of the input features.

In [None]:
# Clustering Labels using single-link

In [None]:
# Clustering Labels using complete-link

Discuss any differences between the results from 2.1.1 and 2.1.2.

## 2.2.1 (7.5%) Clustering the Iris Classification problem: K-Means

Load the Iris Dataset [Iris Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/iris.arff)

Run K-Means on the Iris dataset using the output label as a feature and without using the output label as a feature

Requirements:
- State whether you normalize your data or not (your choice).  
- Show your results for clusterings using k = 2-7.  
- Graph the silhouette score for each k and discuss your results (i.e. what kind of clusters are being made).
---

In [5]:
# Iris Classification without output label

In [6]:
# Iris Classification with output label

Compare results and differences between using the output label and excluding the output label

## 2.2.2 (5%) Clustering the Iris Classification problem: K-Means

Requirements:
- Use the output label as an input feature
- Run K-Means 5 times with k=4, each time with different initial random centroids and discuss any variations in the results. 

In [None]:
#K-Means 5 times

Discuss any variations in the results

## 3.1 (12.5%) Run the SK versions of HAC (both single and complete link) on iris including the output label and compare your results with those above.
Use the silhouette score for this iris problem(k = 2-7).  You may write your own code to do silhouette (optional extra credit) or you can use sklearn.metrics.silhouette_score. Please state if you coded your own silhouette score function to receive the extra credit points (described below). Discuss how helpful Silhouette appeared to be for selecting which clustering is best. You do not need to supply full Silhouette graphs, but you could if you wanted to.

Requirements
- Use the Sillhouette score for this iris problem (k= 2-7) 
- Use at least one other scoring function from [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) and compare the results. State which metric was used. 
- Possible sklean metrics include (* metrics require ground truth labels):
    - adjusted_mutual_info_score*
    - adjusted_rand_score*
    - homogeneity_score*
    - completeness_score*
    - fowlkes_mallows_score*
    - calinski_harabasz_score
    - davies_bouldin_score
- Experiment using different hyper-parameters. Discuss Results

In [None]:
# Load sklearn



*Record impressions*

## 3.2 (12.5%) Run the SK version of k-means on iris including the output label and compare your results with those above. 

Use the silhouette score for this iris problem(k = 2-7). You may write your own code to do silhouette (optional extra credit) or you can use sklearn.metrics.silhouette_score. Please state if you coded your own silhouette score function to receive the extra credit points (described below). Discuss how helpful Silhouette appeared to be for selecting which clustering is best. You do not need to supply full Silhouette graphs, but you could if you wanted to.

Requirements
- Use the Sillhouette score for this iris problem (k= 2-7) 
- Use at least one other scoring function form sklearn.metrics and compare the results. State which metric was used
- Experiment different hyper-parameters. Discuss Results

In [None]:
# Load sklearn 



*Record impressions*

## 4. (Optional 5% extra credit) For your silhouette experiment above, write and use your own code to calculate the silhouette scores, rather than the SK or other version. 


*Show findings here*

In [None]:
# Copy function Below