# K-means clustering
Alexa Serrano Negrete

The notebook aims to study and implement a k-means clustering using "sklearn". The cartwheel dataset will be used to identify clusters automatically.


## Acknowledgments

- Data from https://www.coursera.org/ from the course "Understanding and Visualizing Data with Python" by University of Michigan


# Importing libraries

In [13]:
import pandas as pd                 # For data handling
import seaborn as sns               # For advanced plotting
import matplotlib.pyplot as plt     # For showing plots
from sklearn.cluster import KMeans

# Importing data

In [14]:
route = "datasets/cartwheel/cartwheel.csv"
df = pd.read_csv(route)

# Undertanding and preprocessing the data

1. Get a general 'feel' of the data


In [15]:
print(df.shape)
print(df.head)

(28, 12)
<bound method NDFrame.head of     ID   Age Gender  GenderGroup Glasses  GlassesGroup  Height  Wingspan  \
0    1  56.0      F            1       Y             1   62.00      61.0   
1    2  26.0      F            1       Y             1   62.00      60.0   
2    3  33.0      F            1       Y             1   66.00      64.0   
3    4  39.0      F            1       N             0   64.00      63.0   
4    5  27.0      M            2       N             0   73.00      75.0   
5    6  24.0      M            2       N             0   75.00      71.0   
6    7  28.0      M            2       N             0   75.00      76.0   
7    8  22.0      F            1       N             0   65.00      62.0   
8    9  29.0      M            2       Y             1   74.00      73.0   
9   10  33.0      F            1       Y             1   63.00      60.0   
10  11  30.0      M            2       Y             1   69.50      66.0   
11  12  28.0      F            1       Y         

2. Drop rows with any missing values

In [16]:
df = df.dropna()
print(df.shape)

(25, 12)


3. Encoding the class label categorical column: from string to num


In [17]:
df = df.replace({"Flower":  {"Iris-setosa":0, "Iris-versicolor":1, "Iris-virginica":2} })
df

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56.0,F,1,Y,1,62.0,61.0,79,Y,1.0,7
1,2,26.0,F,1,Y,1,62.0,60.0,70,Y,1.0,8
2,3,33.0,F,1,Y,1,66.0,64.0,85,Y,1.0,7
3,4,39.0,F,1,N,0,64.0,63.0,87,Y,1.0,10
4,5,27.0,M,2,N,0,73.0,75.0,72,N,0.0,4
5,6,24.0,M,2,N,0,75.0,71.0,81,N,0.0,3
6,7,28.0,M,2,N,0,75.0,76.0,107,Y,1.0,10
7,8,22.0,F,1,N,0,65.0,62.0,98,Y,1.0,9
8,9,29.0,M,2,Y,1,74.0,73.0,106,N,0.0,5
9,10,33.0,F,1,Y,1,63.0,60.0,65,Y,1.0,8


4. Discard columns that won't be used


In [19]:
df.drop(['Sepal Width', 'Sepal Length'], axis = 'columns', inplace=True)
df

KeyError: "['Sepal Width' 'Sepal Length'] not found in axis"

5. Scatter plot of the data

In [None]:
plt.scatter(df.PetalLength,dataset.PetalWidth)
plt.title('Petal Width vs Petal Length')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()

6. Scatter plot of the data asigning each point to the cluster it belongs to ¡¡

In [None]:
df1 = df[df.Flower == 0]
df2 = df[df.Flower == 1]
df3 = df[df.Flower == 2]

plt.scatter(df1.PetalLength, df1.PetalWidth, Label = 'Flower Group 1')
plt.scatter(df2.PetalLength, df2.PetalWidth, Label = 'Flower Group 2')
plt.scatter(df3.PetalLength, df3.PetalWidth, Label = 'Flower Group 3')

plt.title('Petal Length vs Petal Width')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend()
plt.show()

So, for this dataset we know in advance the class to which each point belongs to

# Kmeans clustering

Kmeans clustering

In [None]:
km = KMeans(n_clusters = 3)
yp = km.fit_predict(df[['PetalLength','PetalWidth']])
yp

In [None]:
df['Cluster1'] = yp
df

In [None]:
df.Cluster1.unique()

In [None]:
km.cluster_centers_

In [None]:
km.inertia_

<span style='color:Blue'> **Important remarks**  </span>

- The number of each cluster is randomly assigned
- The order of the number in each cluster is random

# Plot estimated clusters

Plot estimated clusters

In [None]:
df1 = df[df.Cluster1 == 0]
df2 = df[df.Cluster1 == 1]
df3 = df[df.Cluster1 == 2]


plt.scatter(df1.PetalLength, df1.PetalWidth, Label = 'Estimated Flower Group 1')
plt.scatter(df2.PetalLength, df2.PetalWidth, Label = 'Estimated Flower Group 2')
plt.scatter(df3.PetalLength, df3.PetalWidth, Label = 'Estimated Flower Group 3')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color = 'yellow', marker='*', label='Centroides', s=256)

plt.title('Petal Width vs Petal Length')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend()
plt.show()

Plot real clusters and visual comparison

In [None]:
df1 = df[df.Flower==0]
df2 = df[df.Flower==1]
df3 = df[df.Flower==2]

plt.scatter(df1.PetalLength, df1.PetalWidth, Label = 'Flower Group 1')
plt.scatter(df2.PetalLength, df2.PetalWidth, Label = 'Flower Group 2')
plt.scatter(df3.PetalLength, df3.PetalWidth, Label = 'Flower Group 3')

plt.title('Petal Width vs Petal Length')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend()
plt.show()

# Selecting K: elbow plot

Check the acurracy of the model using k-fold cross-validation

In [None]:
sse = []
Krange = range(1,10)
for k in Krange:
    km = KMeans(n_clusters = k)
    km.fit_predict(df[['Petal Length','Petal Width']])
    sse.append(km.inertia_)

In [None]:
plt.plot(Krange,sse)
plt.title('Elbow plot')
plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.show()

<span style='color:Blue'> **Important remarks**  </span>

According to the Elbot plot, the selected K agree with the real number of clusters



# Final remarks

- K-Means clustering algorithm is perhaps the simplest and most popular unsupervised learning algorithm

- The number of clusters have to be defined by the user (i.e., by you ¡¡)

- The number assigned to each cluster is randomly assigned from set 0, 1, 2

- The order of the number in each cluster is random

- The **sklearn** package provides the tools for data processing suchs as k-means

# Activity: work with the iris dataset

 - For the following cases, do Kmean and without using min max scaling: determine whether the scaling benefits or not
 
 - Also, compute and show the elbow plot
    
1. Do clustering with the iris flower dataset to form clusters using as features petal width and length. Drop out the other two features (sepal width and length) for simplicity.


2. Do clustering with the iris flower dataset to form clusters using as features sepal width and length. Drop out the other two features (petal width and length) for simplicity.


3. Do clustering with the iris flower dataset to form clusters using as features sepal and petal width and length. Notice that scatter plots here are not possible





5. Draw conclusiones:
    - About the scalling: does it helps or not?
    - About the elbow plot: does the K agree with the real number of clusters?
    - Comparison between (i) sepal features alone, (ii) petal features alone, and (ii) both sepal with petal features: which one is better/worse? Why?