# K-Means Clustering

K-means is an **unsupervised learning** algorithm that learns the features of a dataset and tries to "figure out" the groups (categories, which are the "k" value) based on similarities. These groups are classified by their distance to a **centroid**, which is the mean of a cluster group. K-means can help identify segments of data points that have similar features, even though they may not belong to the same target category. Unsupervised learning is less about trying to predict the correct categories, moreso than it is about finding trends of data points that seem to be similar. 

<center>![Supervised vs Unsupervised learning](https://notebooks.azure.com/priesterkc/projects/testdb/raw/kmeans_cluster.png)</center>

Source: [Towards Data Science: Unsupervised Learning with Python](https://towardsdatascience.com/unsupervised-learning-with-python-173c51dc7f03)

In [None]:
from sklearn.cluster import KMeans  #algorithm in sklearn library to do k-means clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
#use dataset with
location = "datasets/kmeansdata5.csv"
df = pd.read_csv(location)

#in this dataframe, features 1 & 2 are the characteristics of each data point (predictive features) 
#C is the target variable (categories to predict)
df

In [None]:
#plot data points to visually see where they are
#on small and distinctive dataset, easier to see clusters
plt.scatter(df['feature1'],df['feature2'])

In [None]:
#make a dataframe called X to hold the predictive features
X = df.drop('category', axis=1)
X.head()

In [None]:
#dataframe of one column
#holds target variable "category"
y = df['category'].copy()
y.head()

In [None]:
#initialize k-means function
#set number of clusters to categorize = 4
kmeans = KMeans(n_clusters=4)

#the model will learn which data points seem similar
#calculates centroids
#classfies data points in clusters based on distance to centroids
kmeans.fit(X)

In [None]:
#data point coordinates for centroids of each cluster
#index 0 is blue cluster
#index 1 is yellow cluster
#index 2 is purple cluster
#index 3 is red cluster
print(kmeans.cluster_centers_)

In [None]:
#scatterplot of original categories with k-means calculated centroids

#clusters of original features based on original category
plt.scatter(df['feature1'],df['feature2'], c=df['category'], cmap= 'rainbow')

#plot x, y axis coordinates for centroids
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='black')

In [None]:
#add a new column to dataframe called "cluster"
#can use this to compare features of clusters to original categories
df['cluster'] = kmeans.labels_

**Category number and cluster number will not always be the same!**

Cluster number is assigned based on the index number of the centroid a data point is closest to.

In [None]:
#dataframe with new cluster column
df.head()

In [None]:
#scatterplot of k-means predicted clusters with calculated centroids

#k-means predicted clusters for the original data features
plt.scatter(df['feature1'],df['feature2'], c=df['cluster'], cmap= 'rainbow')

#plot x, y axis coordinates for centroids
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='black')

K Means classified (61,11) as category 2 instead of 3 (original 'C' column)

In [None]:
#predict category of new data point
kpred = [50,50]

#model thinks new data point belongs to upper-left cluster
print(kmeans.predict([kpred]))

In [None]:
#see plot of new data point

#k-means predicted clusters (in color) for the data points
plt.scatter(df['feature1'],df['feature2'], c=df['cluster'], cmap= 'rainbow')

#new data point is the black dot
plt.scatter(kpred[0],kpred[1],c='black')

## Iris Flowers

In the example below, we will use the Iris sample dataset from the Scikit-learn (sklearn) library.

In [None]:
from sklearn import datasets

In [None]:
#load in the "box" of items that belong to the iris data
iris_box = datasets.load_iris()

In [None]:
#items that are in the iris "box"
iris_box.keys()

#### Items that are contained in each key

**data**: Iris dataset (raw data, no column headers, no target variable)

**target**: contains a single array of all the target variable values (in order of the row numbers in "data")

**target_names**: contains the distinct (unique) category values from the target variable

**DESCR**: contains a description of the dataset

**feature_names**: contains a list of all the column header names for "data" (does not have target column header name)

In [None]:
print(iris_box.DESCR)

In [None]:
#make a dataframe from the "data" key
#add column headers from "feature_names" key
irisdf = pd.DataFrame(data=iris_box.data, columns=iris_box.feature_names)
irisdf.head()

In [None]:
#add target variable values to the dataframe
#values are in order of the row they belong to
irisdf['cat_num'] = iris_box.target
irisdf.head()

In [None]:
#species names from "target_names"
#species column numbers are in order of this index (0=setosa, 1=versicolor, 2=virginica)
iris_box.target_names

In [None]:
#change number in species column to species name
irisdf['species'] = irisdf['cat_num'].map({0:'setosa', 1:'versicolor', 2:'virginica'})
irisdf.head()

In [None]:
#descriptive statistics of Iris dataset
irisdf.describe()

In [None]:
#average measurements for each iris flower species characteristics (sepal length & width, petal length & width)
irisdf.groupby('species').mean()

In [None]:
#plot iris species by sepal width vs length

#purple cluster is setosa
#green cluster is versicolor
#red cluster is virginica
plt.xlabel('sepal width (cm)')
plt.ylabel('sepal length (cm)')
plt.scatter(irisdf['sepal width (cm)'], irisdf['sepal length (cm)'], c=irisdf['cat_num'], cmap='rainbow')

In [None]:
#plot iris species by petal width vs length

#purple cluster is setosa
#green cluster is versicolor
#red cluster is virginica
plt.scatter(irisdf['petal width (cm)'], irisdf['petal length (cm)'], c=irisdf['cat_num'], cmap='rainbow')

plt.xlabel('petal width (cm)')
plt.ylabel('petal length (cm)')

#### Let's see if the k-means algorithm can figure out that the clusters relate to iris flower species

First, we'll try it with the entire dataset. Then later we'll add a new data point (flower) and see which species (target) it will classify it as.

In [None]:
#dataframe containing only predictive features
X = irisdf.drop(['species', 'cat_num'], axis=1)
X.head()

In [None]:
y = irisdf['cat_num'].copy()
y.head()

In [None]:
#initialize the k-means algorithm
#setting it to find 3 clusters (category groups)
kmeans = KMeans(n_clusters=3)

In [None]:
#teach the model where the data points are
#calculates centroids
#classifes data points to a cluster depending on closest centroid
kmeans.fit(X)

In [None]:
#data point coordinates for centroids of each cluster
#each line is a flower species (cluster)

#index 0 is versicolor
#index 1 is setosa
#index 2 is virginica
print(kmeans.cluster_centers_)

In [None]:
#scatterplot of sepal width & length with k-means calculated centroids

#clusters of original features based on original category
plt.scatter(irisdf['sepal width (cm)'], irisdf['sepal length (cm)'], c=irisdf['cat_num'], cmap='rainbow')

#plot x, y axis coordinates for centroids
plt.scatter(kmeans.cluster_centers_[:,1], kmeans.cluster_centers_[:,0], c='black')

plt.xlabel('sepal width (cm)')
plt.ylabel('sepal length (cm)')

In [None]:
#scatterplot of petal width & length with k-means calculated centroids

#clusters of original features based on original category
plt.scatter(irisdf['petal width (cm)'], irisdf['petal length (cm)'], c=irisdf['cat_num'], cmap='rainbow')

#plot x, y axis coordinates for centroids
plt.scatter(kmeans.cluster_centers_[:,3], kmeans.cluster_centers_[:,2], c='black')

plt.xlabel('petal width (cm)')
plt.ylabel('petal length (cm)')

In [None]:
irisdf['clusters'] = kmeans.labels_
irisdf.head()

In [None]:
#scatterplot of sepal width & length 
#predicted cluster group with k-means calculated centroids

#clusters of original features based on original category
plt.scatter(irisdf['sepal width (cm)'], irisdf['sepal length (cm)'], c=irisdf['clusters'], cmap='rainbow')

#plot x, y axis coordinates for centroids
plt.scatter(kmeans.cluster_centers_[:,1], kmeans.cluster_centers_[:,0], c='black')

plt.xlabel('sepal width (cm)')
plt.ylabel('sepal length (cm)')

In [None]:
#scatterplot of petal width & length 
#predicted cluster group with k-means calculated centroids

#clusters of original features based on original category
plt.scatter(irisdf['petal width (cm)'], irisdf['petal length (cm)'], c=irisdf['clusters'], cmap='rainbow')

#plot x, y axis coordinates for centroids
plt.scatter(kmeans.cluster_centers_[:,3], kmeans.cluster_centers_[:,2], c='black')

plt.xlabel('petal width (cm)')
plt.ylabel('petal length (cm)')

In [None]:
#new flower data point
#sepal length=7.2, sepal width=3.5, petal length=0.8, petal width=1.6

point = [7.2, 3.5, 0.8, 1.6]

#predict category of new data point
#model thinks new data point belongs to setosa cluster (1)
print(kmeans.predict([point]))

In [None]:
#see plot of new data point for sepal width & length

#k-means predicted clusters (in color) for the data points
plt.scatter(irisdf['sepal width (cm)'], irisdf['sepal length (cm)'], c=irisdf['clusters'], cmap= 'rainbow')

#new data point is the black dot
plt.scatter(point[1],point[0],c='black')

plt.xlabel('sepal width (cm)')
plt.ylabel('sepal length (cm)')

In [None]:
#see plot of new data point

#k-means predicted clusters (in color) for the data points
plt.scatter(irisdf['petal width (cm)'], irisdf['petal length (cm)'], c=irisdf['clusters'], cmap= 'rainbow')

#new data point is the black dot
plt.scatter(point[3],point[2],c='black')

plt.xlabel('petal width (cm)')
plt.ylabel('petal length (cm)')