# Data Science Mathematics
# K-Means Clustering
# In-Class Activity

Let's analyze our data set using the K-means module of Python.  First, import the relevant libraries.

In [2]:
from sklearn.cluster import KMeans
import numpy as np

Now let's import our dataset as a Numpy array.

In [48]:
data=np.array([[8,22,62],
[15,51,85],
[9,44,121],
[8,51,136],
[8,20,93],
[15,64,124],
[14,56,101],
[5,10,80],
[5,18,73],
[9,26,79]])

labels = [1,0,0,0,1,0,0,1,1,1]

centroids=[[10,20,80],[10,50,110]]

In [49]:
def calcdist(data,centroids):
    dist1 = []
    dist2 = []
    for pt in data: 
        pt1dist = np.sqrt((pt[0]-centroids[0][0])**2+(pt[1]-centroids[0][1])**2+(pt[2]-centroids[0][2])**2)
        pt2dist = np.sqrt((pt[0]-centroids[1][0])**2+(pt[1]-centroids[1][1])**2+(pt[2]-centroids[1][2])**2)
        dist1.append(pt1dist)
        dist2.append(pt2dist)
    return dist1, dist2 

In [50]:
def cluster(data,dist1,dist2):
    cluster1 = []
    cluster2 = []
    for idx in range(len(data)):
        if dist1[idx] < dist2[idx]:
            cluster1.append(list(data[idx]))
        else: cluster2.append(list(data[idx]))
            
    return cluster1, cluster2

In [51]:
def calccentroids(cluster):
    x = 0
    y = 0
    z = 0
    for pt in cluster:
        x += pt[0]
        y += pt[1]
        z += pt[2]
    return [x/len(cluster),y/len(cluster),z/len(cluster)]

In [52]:
#Write the algorithms for 
for i in range(5):
    print(centroids)
    dist1,dist2 = calcdist(data,centroids)
    cluster1,cluster2 = cluster(data,dist1,dist2)
    centroids[0] = calccentroids(cluster1)
    centroids[1] = calccentroids(cluster2)

[[10, 20, 80], [10, 50, 110]]
[[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]
[[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]
[[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]
[[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]


In [53]:
cluster1

[[8, 22, 62], [8, 20, 93], [5, 10, 80], [5, 18, 73], [9, 26, 79]]

In [54]:
output = []

for pt in data:
    if list(pt) in cluster1:
        output.append(1)
    else: output.append(0)
    
print(output)
print(labels)

[1, 0, 0, 0, 1, 0, 0, 1, 1, 1]
[1, 0, 0, 0, 1, 0, 0, 1, 1, 1]


Now let's instantiate our k-means object, trained on our data set.

In [3]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)

We can use the "labels" method to get our data labels.  Each different integer represents a different cluster.

In [4]:
kmeans.labels_

array([1, 1, 0, 0, 1, 0, 0, 1, 1, 1])

Do the lables make sense based on our input data?  Go back to the in-class activity and see if the labels ar the same.  Note that this algorithm may choose a different label convention (i.e., not 1=Military and 0=Non-Military, like in our example).  What we are interested in is the correct pattern in the label sequence.

Now let's find our centroids.  Do they match what you calculated where you wrote the code above?

In [5]:
kmeans.cluster_centers_

array([[ 11.5       ,  53.75      , 120.5       ],
       [  8.33333333,  24.5       ,  78.66666667]])

***Now save your output.  Go to File -> Print Preview and save your final output as a PDF.  Turn in to your Instructor, along with any additional sheets.

###How well did your algorithm cluster military personnel versus non-military personnel? Construct a confusion matrix, and calculate the Matthews's Correlation Coefficient (write the code vs. using NumPy - feel free to check with NumPy)

In [4]:
# The Confusion Matrix would look something like this:
# TP: 5 / FP: 0 / TN: 5 / FN: 0
# Calculating the MCC = (TP*TN-FP*FN)/((TP+FP)*(TP+FN)*(TN+FP)(TN+FN))**1/2

Numerator = 5*5-0*0
Denominator = np.sqrt((5+0)*(5+0)*(5+0)*(5+0))
MCC = (Numerator/Denominator)
print(MCC)

#A result of 1.0 means they are perfectly correlated in a positive direction (algorithm did as well as the original data)

1.0


###You selected three features to use in this computation because you determined that they are the three most correlated features with "military" status. While adding additional features up to a certain point will enhance clustering model accuracy, adding too many features diminishes accuracy. Explain why this is true.

So, the "curse of dimensionality" basically says that the more dimensions you add, the greater the chance of reaching a place (n-dimensional space) where all points are far from a centroid because of the number of dimensions -- the model that best explains this is the volume of an n-dimensional sphere inside of an n-dimensional cube - as the number of dimensions increases and the volume goes with it, the corners of the cube (data points) become equally far from the center (centroids), making the centroids of little to no value.