The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[[1](https://en.wikipedia.org/wiki/Iris_flower_data_set#:~:text=The%20Iris%20flower%20data%20set,example%20of%20linear%20discriminant%20analysis.)] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3] Fisher's paper was published in the journal, the Annals of Eugenics, creating controversy about the continued use of the Iris dataset for teaching statistical techniques today.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

![https://miro.medium.com/max/2550/0*GVjzZeYrir0R_6-X.png](https://miro.medium.com/max/2550/0*GVjzZeYrir0R_6-X.png)

source image:https://miro.medium.com/max/2550/0*GVjzZeYrir0R_6-X.png

This study we try to clustering Iris Dataset used Kmeans

[Attribute Information:
](https://archive.ics.uci.edu/ml/datasets/iris)
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns # to plot
from sklearn.cluster import KMeans 
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler

### reading dataset

In [None]:
iris = pd.read_csv("data/IRIS.csv")
x = iris.iloc[:, [0, 1, 2, 3]].values
iris.head()

In [None]:
x

In [None]:
iris.tail()

In [None]:
iris[5:22:2] #start: end: step

In [None]:
iris.info()
iris[0:10]

In [None]:
# access a particular cell with corresponding row and column value
# dataframe["column_name"][row_index]
iris["species"][120]

In [None]:
#Frequency distribution of species"
iris_outcome = pd.crosstab(index=iris["species"],  # Make a crosstab"
                              columns="count")      # Name the count column

iris_outcome

In [None]:
# #Frequency distribution of species"
# iris_outcome = pd.crosstab(index=iris["sepal_width"],  # Make a crosstab"
#                               columns="count")      # Name the count column

# iris_outcome

In [None]:
# #Frequency distribution of species"
# iris_outcome = pd.crosstab(index=iris["petal_length"],  # Make a crosstab"
#                               columns="count")      # Name the count column

# iris_outcome

In [None]:
# #Frequency distribution of species"
# iris_outcome = pd.crosstab(index=iris["petal_width"],  # Make a crosstab"
#                               columns="count")      # Name the count column

# iris_outcome

In [None]:
iris_outcome

In [None]:
# creating three dataframes corresponding to the three flowers from main dataframe iris
iris_setosa=iris.loc[iris["species"]=="Iris-setosa"]
iris_virginica=iris.loc[iris["species"]=="Iris-virginica"]
iris_versicolor=iris.loc[iris["species"]=="Iris-versicolor"]

In [None]:
iris_setosa

## **Distribution plots**

Plot each flower to a histogram

In [None]:
sns.FacetGrid(iris,hue="species",height=3).map(sns.histplot,"petal_length").add_legend()
sns.FacetGrid(iris,hue="species",height=3).map(sns.histplot,"petal_width").add_legend()
sns.FacetGrid(iris,hue="species",height=3).map(sns.histplot,"sepal_length").add_legend()
sns.FacetGrid(iris,hue="species",height=3).map(sns.histplot,"sepal_width").add_legend()
plt.show()

box plot

In [None]:
sns.boxplot(y="petal_length",data=iris)
plt.show()

### Blox plot of petal_length

In [None]:
sns.boxplot(x="species",y="petal_length",data=iris)
plt.show()

### Blox plot of petal_width

In [None]:
sns.boxplot(x="species",y="petal_width",data=iris)
plt.show()

### Blox plot of sepal_width

In [None]:
sns.boxplot(x="species",y="sepal_width",data=iris)
plt.show()

### Blox plot of sepal_length

In [None]:
sns.boxplot(x="species",y="sepal_length",data=iris)
plt.show()

In [None]:
sns.boxplot(x="species",y="sepal_length",data=iris)
plt.show()
sns.boxplot(x="species",y="sepal_width",data=iris)
plt.show()
sns.boxplot(x="species",y="petal_length",data=iris)
plt.show()
sns.boxplot(x="species",y="petal_width",data=iris)
plt.show()

**Scatter plot**


In [None]:
sns.set_style("whitegrid")
sns.pairplot(iris,hue="species",height=3);
plt.show()

# K-Means

[K-means](http://https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/) is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

# How to Implementing K-Means Clustering ?

* Choose the number of clusters k
* Select k random points from the data as centroids
* Assign all the points to the closest cluster centroid
* Recompute the centroids of newly formed clusters
* Repeat steps 3 and 4


In [None]:
#Finding the optimum number of clusters for k-means classification
from sklearn.cluster import KMeans
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)

# Using the elbow method to determine the optimal number of clusters for k-means clustering


In [None]:
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

# Implementing K-Means Clustering

In [None]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)

In [None]:
#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'purple', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'orange', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'red', label = 'Centroids')

plt.legend()

In [None]:
# 3d scatterplot using matplotlib

fig = plt.figure(figsize = (15,15))
ax = fig.add_subplot(111, projection='3d')
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'purple', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'orange', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'red', label = 'Centroids')
plt.show()