## CS530 Data Mining Homework 4 part 1

#### Question 1 (3 points): The Iris Dataset 

Load the Iris dataset using “datasets.load_iris()” from the Scikit-learn library. You can find the documentation of this dataset on Scikit-learn. Then Write a function that takes in two inputs:
1.	The data part of the Iris set without the labels
2.	k, the number of clusters
The function should implement the k-means algorithm as learned in class. Hence, the output of the function should be a list of cluster labels for each record of the Iris dataset, from 1 to k. 

In [35]:
import numpy as np
from sklearn.datasets import load_iris


def cluster_data(x, y):
    y = np.asarray(y)
    x = np.asarray(x)
    y_uniques = np.unique(y)
    return [x[y == yi] for yi in y_uniques]


def euclidean_distance(u, v):
    totalDistance = 0
    for i in range(len(u)):
        distance = np.math.sqrt((u[i] - v[i]) ** 2)
        totalDistance += distance
    return totalDistance


def update_cluster_label(centroids, cluster_labels, x):
    for i in range(len(x)):
        distance = [np.linalg.norm(x[i] - centroid) for centroid in centroids]
        label = distance.index(min(distance))
        cluster_labels.append(label)


def update_centroid(centroids, clusters):
    for i in range(len(clusters)):
        centroid = clusters[i].mean(axis=0)
        centroids[i] = centroid


def k_means(x, k=3, max_iterations=1000, tolerance=0.001):
    cluster_labels = []
    currentIterations = 0
    centroids = x[np.random.choice(x.shape[0], k, replace=False), :]

    while True:
        update_cluster_label(centroids, cluster_labels, x)
        clusters = cluster_data(x, cluster_labels)

        previousCentroids = np.array(centroids)
        update_centroid(centroids, clusters)
        currentError = np.sum((centroids - previousCentroids) / previousCentroids * 100, dtype=np.float32)
        if (currentIterations >= max_iterations) or (abs(currentError) <= tolerance):
            print(f'Iterations : {currentIterations}')
            print(f'Current Error : {currentError}')
            print(f'Current Centroids : {centroids}')
            break
        cluster_labels = []
        currentIterations += 1

    return cluster_labels

In [36]:
iris = load_iris()
X = iris.data
labels = k_means(X)
print(labels)


Iterations : 7
Current Error : 0.0
Current Centroids : [[6.85       3.07368421 5.74210526 2.07105263]
 [5.9016129  2.7483871  4.39354839 1.43387097]
 [5.006      3.428      1.462      0.246     ]]
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1]
