# Clustering Iris Species

## Goal

In this notebook we will
1. have a very brief introduction to Pandas
2. use Pandas to load the iris dataset
3. use our implementation of Kmeans to cluster the iris dataset into groups, and check their consistency with the original iris species
4. discuss how an appropriate number of groups for a given dataset can be determined

## A very brief introduction to Pandas

### Loading data

Read the file `Safari.csv` and save it in a variable named Safari

In [1]:
import pandas as pd
pd.set_option('max_columns',50)
pd.set_option('max_rows',10)
Safari = pd.read_csv("Safari.csv")

What is the number of rows and columns in this data?

In [2]:
Safari.shape

(20, 5)

To get an idea of what the data looks like, display some of its leading and tailing content

In [3]:
Safari.head()

Unnamed: 0,type,id,food consumption,water consumption,Age
0,Elephant,E1,42,105,40
1,Elephant,E2,41,122,15
2,Tiger,T1,12,30,8
3,Tiger,T2,22,35,11
4,Tiger,T3,18,33,3


In [4]:
Safari.tail()

Unnamed: 0,type,id,food consumption,water consumption,Age
15,Lion,L3,21,52,14
16,Lion,L4,14,38,9
17,Hippopotamus,H1,42,83,31
18,Hippopotamus,H2,47,91,45
19,Hippopotamus,H3,41,76,12


In [5]:
Safari.head(17)

Unnamed: 0,type,id,food consumption,water consumption,Age
0,Elephant,E1,42,105,40
1,Elephant,E2,41,122,15
2,Tiger,T1,12,30,8
3,Tiger,T2,22,35,11
4,Tiger,T3,18,33,3
...,...,...,...,...,...
12,Zebra,Z6,6,12,15
13,Lion,L1,12,42,5
14,Lion,L2,22,61,12
15,Lion,L3,21,52,14


Print the number of unique values in each of Safari's columns

In [6]:
Safari.nunique()

type                  5
id                   20
food consumption     14
water consumption    20
Age                  15
dtype: int64

Let's examine more closely one of te columns. Specifically, how many different types of animals are included in Safari?

In [7]:
Safari['type'].nunique()

5

What are the unique types of animals included in Safari?

In [8]:
Safari['type'].unique()

array(['Elephant', 'Tiger', 'Zebra', 'Lion', 'Hippopotamus'], dtype=object)

How many rows include each of Safari's unique types?

In [9]:
Safari['type'].value_counts()

Zebra           6
Tiger           5
Lion            4
Hippopotamus    3
Elephant        2
Name: type, dtype: int64

Compute a high-level statistical description of the columns in the Safari data (mean, std etc.)

In [10]:
Safari.describe()

Unnamed: 0,food consumption,water consumption,Age
count,20.0,20.0,20.0
mean,21.05,46.85,18.05
std,13.926404,32.251112,12.002083
min,4.0,10.0,3.0
25%,11.75,23.75,10.75
50%,16.0,34.0,14.5
75%,29.0,64.75,24.5
max,47.0,122.0,45.0


Do the same for the Lion group alone

In [11]:
Lions = Safari[Safari['type']=='Lion']
Lions

Unnamed: 0,type,id,food consumption,water consumption,Age
13,Lion,L1,12,42,5
14,Lion,L2,22,61,12
15,Lion,L3,21,52,14
16,Lion,L4,14,38,9


In [12]:
Lions.describe()

Unnamed: 0,food consumption,water consumption,Age
count,4.0,4.0,4.0
mean,17.25,48.25,10.0
std,4.99166,10.340052,3.91578
min,12.0,38.0,5.0
25%,13.5,41.0,8.0
50%,17.5,47.0,10.5
75%,21.25,54.25,12.5
max,22.0,61.0,14.0


We can acieve this also using a Pandas built in functionality

In [None]:
gLionsSafari = Safari.groupby('type').get_group('Lion')
gLionsSafari

In [None]:
gLionsSafari.describe()

Do the same for each of the unique animal types in the Safary

In [13]:
Safari.groupby('type').describe()

Unnamed: 0_level_0,Age,Age,Age,Age,Age,Age,Age,Age,food consumption,food consumption,food consumption,food consumption,food consumption,food consumption,food consumption,food consumption,water consumption,water consumption,water consumption,water consumption,water consumption,water consumption,water consumption,water consumption
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
Elephant,2.0,27.5,17.67767,15.0,21.25,27.5,33.75,40.0,2.0,41.5,0.707107,41.0,41.25,41.5,41.75,42.0,2.0,113.5,12.020815,105.0,109.25,113.5,117.75,122.0
Hippopotamus,3.0,29.333333,16.563011,12.0,21.5,31.0,38.0,45.0,3.0,43.333333,3.21455,41.0,41.5,42.0,44.5,47.0,3.0,83.333333,7.505553,76.0,79.5,83.0,87.0,91.0
Lion,4.0,10.0,3.91578,5.0,8.0,10.5,12.5,14.0,4.0,17.25,4.99166,12.0,13.5,17.5,21.25,22.0,4.0,48.25,10.340052,38.0,41.0,47.0,54.25,61.0
Tiger,5.0,9.4,4.393177,3.0,8.0,10.0,11.0,15.0,5.0,17.8,5.848077,12.0,12.0,18.0,22.0,25.0,5.0,31.6,2.408319,29.0,30.0,31.0,33.0,35.0
Zebra,6.0,21.833333,9.724539,12.0,15.0,19.0,27.5,37.0,6.0,8.333333,3.011091,4.0,6.5,8.5,10.5,12.0,6.0,18.166667,5.946988,10.0,13.5,20.0,22.75,24.0


In [14]:
Safari.groupby('type').describe()['food consumption']

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Elephant,2.0,41.5,0.707107,41.0,41.25,41.5,41.75,42.0
Hippopotamus,3.0,43.333333,3.21455,41.0,41.5,42.0,44.5,47.0
Lion,4.0,17.25,4.99166,12.0,13.5,17.5,21.25,22.0
Tiger,5.0,17.8,5.848077,12.0,12.0,18.0,22.0,25.0
Zebra,6.0,8.333333,3.011091,4.0,6.5,8.5,10.5,12.0


In [15]:
Safari.groupby('type').describe()['food consumption'][['min','max']]

Unnamed: 0_level_0,min,max
type,Unnamed: 1_level_1,Unnamed: 2_level_1
Elephant,41.0,42.0
Hippopotamus,41.0,47.0
Lion,12.0,22.0
Tiger,12.0,25.0
Zebra,4.0,12.0


## Iris data

Read the file `iris.csv` and save it in a DataFrame named iris

In [16]:
pd.set_option('max_columns',50)
pd.set_option('max_rows',10)
iris = pd.read_csv("iris.csv")

![](iris.jpg)

How many rows and columns does this data includes?

In [17]:
iris.shape

(150, 5)

To get an idea of what the data looks like, display some of its leading and tailing content

In [18]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [19]:
iris.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


Print the number of unique values in each of its columns

In [20]:
iris.nunique()

sepal_length    35
sepal_width     23
petal_length    43
petal_width     22
species          3
dtype: int64

How many different types of species does it include?

In [22]:
iris['species'].unique()
iris['species'].count()


150

What are the unique types of species?

In [23]:
iris['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

How many times does each unique specie appear in the iris data?

In [26]:
iris['species'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

Compute a high-level statistical description of the columns in the Safari data (mean, std etc.)

In [27]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Do the same for the `setosa` group alone

In [29]:
setosa = iris.groupby('species').get_group('setosa')


In [30]:
setosa.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,50.0,50.0,50.0,50.0
mean,5.006,3.428,1.462,0.246
std,0.35249,0.379064,0.173664,0.105386
min,4.3,2.3,1.0,0.1
25%,4.8,3.2,1.4,0.2
50%,5.0,3.4,1.5,0.2
75%,5.2,3.675,1.575,0.3
max,5.8,4.4,1.9,0.6


Do the same for each of the unique iris species in the iris dataset 

## Machine Learning with the iris dataset

1. Explain how the iris data can be used for classification 
    - what are the feature columns? 
    - what is the target column?
2. Explain how the iris dataset can be used for clustering

### Clustering the iris dataset

We'll use the same setup and environment we've used to develop our BNHP Kmeans

In [31]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist as distance

np.random.seed(0)

In [32]:
def Plot2dColoredSamples(X,Y,color=list('rgbckmy'),marker=['o'],PlotMeans = False):
    plt.figure(figsize=[7,7])
    uY = np.unique(Y)
    K = len(uY)
    for k in range(0,K):
        if len(marker)==1:
            plt.scatter(X[Y==uY[k],0], X[Y==uY[k],1], c=color[k], marker=marker[0], s=20)
            if PlotMeans:
                m = X[Y==k].mean(axis=0)
                plt.scatter(m[0], m[1], c=color[k], marker='$ ' + str(k) + '$', s=200)
        else:
            plt.scatter(X[Y==uY[k],0], X[Y==uY[k],1], c=color[k], marker='$' + marker[k] + '$', s=20)
            m = X[Y==k].mean(axis=0)
            plt.scatter(m[0], m[1], c=color[k], marker='$ C_{' + str(k) + '}$', s=200)
    plt.gca().set_aspect('equal', adjustable='box')
    return

In [33]:
def BNHP_KmeansWithDistortion(X, K, means=[], NumIterations = 20):
    # Init: for simplicity, we'll take the first K data points to initialize the cluster means
    # You can try replacing it with other initialization, and see what happens
    if len(means)==0:
        means = X[:K]
    # The code below implements the core logic of the Kmeans algorithm
    for _ in range(NumIterations):
        # line 1: compute the distance between each sample and every cluster center
#        distances = distance(means, X)
        distances = distance(X, means)
        # line 2: set the samples of group k to be all the points closest to Ck
#        labels = distances.argmin(axis=0)
        labels = distances.argmin(axis=1)
        # line 3: for each k set Ck to be the mean of all points in group k 
        means = [(X[labels==i].mean(axis=0)) for i in range(len(means))]
        
    distortion = [distance(X[labels==k],means[k].reshape(1,X.shape[1])).sum() for k in range(K)]
    return means, labels, distortion

Read the relevant values of the iris dataset into a feature matrix `X` and a target vector `Y`

In [34]:
X = iris[iris.columns[:4]].values
Y = iris['species'].values

Cluster the data into `K=3` groups using `BNHP_KmeansWithDistortion`

In [35]:
means, labels, distortions = BNHP_KmeansWithDistortion(X, 3)

For each cluster, find how many labels it has from each of the possible species, and associate it with this sepcie

Compute how aligned are the clusters with the original labels

The clusters are aligned with the underlying labels (with setose perfectly separated from the other two species)

### A note on choosing the number of clusters

Compute the total distortion reported by our implementation Kmeans for the following values of `K=[2,3,4,5,6,7,8,9]`, and plot it. 

 Is there any observation that pops out?