# K-means clustering using the iris dataset

The notebook aims to study and implement a k-means clustering using "sklearn". The iris dataset will be used to identify clusters automatically.


## Acknowledgments

- Used dataset: https://archive.ics.uci.edu/ml/datasets/iris




## Importing libraries

In [1]:
# Import the packages that we will be using
import numpy as np                  # For array
import matplotlib.pyplot as plt     # For showing plots
import pandas as pd                 # For data handling
import seaborn as sns               # For advanced plotting

# Note: specific functions of the "sklearn" package will be imported when needed to show concepts easily


## Importing data

In [None]:
# Define the col names for the iris dataset
colnames = ["Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width", "Flower"]

# Dataset url
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
url = "datasets/iris/iris.csv"

# Load the dataset from HHDD
df  = pd.read_csv(url, header = None, names = colnames )

df


## Encoding the class label and remove one of the classes

Encoding the class label and remove one of the classes


In [None]:
# Encoding the class label categorical column: from string to num
df = df.replace({"Flower":  {"Iris-setosa":0, "Iris-versicolor":1, "Iris-virginica":2} })

# NOTE: doing kmeans with sklearn requieres to use the clusters ID from 0

# Visualize the dataset
df


Now the label/category is numeric

Remove all observations from one of the classes

In [None]:
# Discard observation for one of the classes, e.g., class "setosa": to have only two classes in our dataset
#Label2Remove = 3 # (1,2,3)
#df = df[df.Flower!=Label2Remove]
#df

# NOTE: comment this if cell if you want to use the three classes


Scatter plot of the data

In [None]:
# Scatter plot of the data
plt.scatter(df.Petal_Length,df.Petal_Width)
plt.title('Petal')
plt.xlabel('Length')
plt.ylabel('Width')
plt.show()


It seem that the Petal length and width form two cluster, however, we know in advance that there are three classes ¡¡

Scatter plot of the data asigning each point to the cluster it belongs to ¡¡

In [None]:
# Get dataframes for each real cluster
df1 = 
df2 = 
df3 = 

# Scatter plot of each real cluster
plt.scatter(df1.Petal_Length, df1.Petal_Width, Label='Flower type 0', c='r', marker='o', s=64, alpha=0.3)
plt.scatter(df2.Petal_Length, df2.Petal_Width, Label='Flower type 1', c='g', marker='o', s=64, alpha=0.3)
plt.scatter(df3.Petal_Length, df3.Petal_Width, Label='Flower type 2', c='b', marker='o', s=64, alpha=0.3)

plt.title('Petal')
plt.xlabel('Length')
plt.ylabel('Width')
plt.legend()
plt.show()


Recall that for this dataset we know in advance the class to which each point belongs to

## Kmeans clustering using sklearn

Kmeans clustering using sklearn

In [None]:
# Import library
from sklearn.cluster import KMeans

# Create model
km1         = 

# Do K-means clustering
ypredicted1 = 

# Print estimated cluster of each point in the dataset
ypredicted1


In [None]:
df.Flower.values

NOTE: the lables of the estimated clusters do not agree with the lables in the real labels, therefore, it will be important to pair the labels of the real and estimated clusters

In [None]:
# Manual pairing the labels of the real and estimated clusters
ypredicted1new = np.choose(ypredicted1, [0, 2, 1]).astype(int)
ypredicted1new

# NOTE: you need to choose the correct order


In [None]:
# Add a new column to the dataset with the cluster information
df['Cluster1'] = 

df


In [None]:
# Print labels of the estimated clusters
df.Cluster1.unique()


In [None]:
# Cluster centroides
km1.cluster_centers_

# NOTE: centroides also need to be paired. No need to do it here


In [None]:
# Sum of squared error (sse) of the final model
km1.inertia_


## Plot estimated clusters

Plot estimated clusters

In [None]:
# Get dataframes for each estimated cluster
df1 = 
df2 = 
df3 = 

# Scatter plot of each estimated cluster
plt.scatter(df1.Petal_Length, df1.Petal_Width, Label='Cluster 0', c='r', marker='o', s=32, alpha=0.3)
plt.scatter(df2.Petal_Length, df2.Petal_Width, Label='Cluster 1', c='g', marker='o', s=32, alpha=0.3)
plt.scatter(df3.Petal_Length, df3.Petal_Width, Label='Cluster 2', c='b', marker='o', s=32, alpha=0.3)

plt.scatter(km1.cluster_centers_[:,0], km1.cluster_centers_[:,1], color='black', marker='*', label='Centroides', s=256)

plt.title('Petal')
plt.xlabel('Length')
plt.ylabel('Width')
plt.legend()
plt.show()


## Plot both real and estimated clusters to check for errors

In [None]:
# Get dataframes for each real cluster
df1 = 
df2 = 
df3 = 

# Scatter plot of each real cluster
plt.scatter(df1.Petal_Length, df1.Petal_Width, Label='Flower type 0', c='white', edgecolor='r', marker='^', s=64, alpha=0.9)
plt.scatter(df2.Petal_Length, df2.Petal_Width, Label='Flower type 1', c='white', edgecolor='g', marker='<', s=64, alpha=0.9)
plt.scatter(df3.Petal_Length, df3.Petal_Width, Label='Flower type 2', c='white', edgecolor='b', marker='>', s=64, alpha=0.9)

# Get dataframes for each estimated cluster
df1 = df[df.Cluster1==0]
df2 = df[df.Cluster1==1]
df3 = df[df.Cluster1==2]

# Scatter plot of each estimated cluster
plt.scatter(df1.Petal_Length, df1.Petal_Width, Label='Cluster 0',      c='white', edgecolor='r', marker='^', s=16, alpha=0.9)
plt.scatter(df2.Petal_Length, df2.Petal_Width, Label='Cluster 1',      c='white', edgecolor='g', marker='<', s=16, alpha=0.9)
plt.scatter(df3.Petal_Length, df3.Petal_Width, Label='Cluster 2',      c='white', edgecolor='b', marker='>', s=16, alpha=0.9)

plt.title('Petal')
plt.xlabel('Length')
plt.ylabel('Width')
plt.legend()

#plt.xlim(4,6)
#plt.ylim(1,2)

plt.show()


## Compute performance


## Selecting K: elbow plot

Check the acurracy of the model using k-fold cross-validation

In [None]:
# Intialize sum of squared error (sse)
sse = []

# Define values of k
k_rng = 

# For each k
for k in k_rng:
    # Create model
    km = 
    # Do K-means clustering
    
    # Save sse for each k
    sse.append(km.inertia_)


In [None]:
# Plot sse versus k
plt.plot(k_rng,sse)

plt.title('Elbow plot')
plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.show()


Choose the k after which the sse is minimally reduced

## Normalizing the data: preprocessing using min max scaler

Normalizing the data: preprocessing using min max scaler

In [None]:
# Import library
from sklearn.preprocessing import MinMaxScaler

# Initialize scaler
scaler = 


In [None]:
# Scale data
scaler.fit(df[['Petal_Length']])
df['Petal_Length_Scaled'] = scaler.transform(df[['Petal_Length']])

scaler.fit(df[['Petal_Width']])
df['Petal_Width_Scaled'] = scaler.transform(df[['Petal_Width']])

df


In [None]:
df.describe()

In [None]:
# Scatter plot of the scaled data
plt.scatter(df.Petal_Length_Scaled,df.Petal_Width_Scaled)
plt.title('Petal')
plt.xlabel('Length')
plt.ylabel('Width')
plt.show()


In [None]:
# Create model
km2 = 

# Do K-means clustering
ypredicted2= 

# Print estimated cluster of each scaled point in the dataset
ypredicted2


In [None]:
# Add a new column to the dataset with the cluster information
df['Cluster2'] = ypredicted2

df

In [None]:
# Get dataframes for each estimated cluster
df1 = 
df2 = 
df3 = 

# Scatter plot of each estimated cluster
plt.scatter(df1.Petal_Length_Scaled, df1.Petal_Width_Scaled, Label='Cluster 0')
plt.scatter(df2.Petal_Length_Scaled, df2.Petal_Width_Scaled, Label='Cluster 1')
plt.scatter(df3.Petal_Length_Scaled, df3.Petal_Width_Scaled, Label='Cluster 2')

plt.scatter(km2.cluster_centers_[:,0], km2.cluster_centers_[:,1], color='magenta', marker='*', label='Centroides', s=256)

plt.title('Petal')
plt.xlabel('Length')
plt.ylabel('Width')
plt.legend()
plt.show()


# <span style='color:Blue'> Final remarks  </span>

- The number of each cluster is randomly assigned

- The order of the number in each cluster is random

- If there is no information about the number of clusters k, then use the elbow plot method to choose the best number of clusters k



# <span style='color:Blue'> Activity  </span>


1- AAA

