# The Iris Dataset



## Introduction

Originally introduced by biologist and statitician Ronald Fisher in 1936 at UCI Machine Learning Repository, this dataset is largely considered to be the Hello World of machine learning datasets. It is very popular in the field of teaching and is often used to introduce people to machine learning and visualization of datasets.

In 1936 botanist Edgar Anderson was studying the structural variation of three related species of Iris flowers. He collected 50 samples from each species making up 150 total. For each sample he measured the sepal and petal length and width.

The three species of Iris flower collected were the _Iris setosa, Iris virginica and Iris versicolor_

![image of an Iris flower depicting the petal and sepal length, width](./images/iris_petal_sepal.png)


**The data set contains:**

* 3 classes of 50 instances each (the class denoting the species of Iris)
* Measurements of the sepal and petal, length and width in centimetres for every sample

In this notebook I will do my best to explain the dataset itself and show/discuss how to write a supervised learning algorithm to separate the three classes of iris based on the their characteristics. 

The question that we want to answer is _"Can we predict the species of Iris flower using only measurements?_

# Inspecting the Dataset

Our first step is to load and inspect the dataset so we can gain a greater understanding about the layout and additionally to confirm everything above is true.

Let's import the dataset as a Panda dataframe and look at all 150 entries.

In [18]:
# Import pandas for data processing
import pandas as pd

# Load the dataset
irisDataset = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

# Display the 150 entries
irisDataset

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [19]:
# Dataset shape
print(irisDataset.shape)

(150, 5)


- - -

Each row represents the sample and each column represents the features. The first four columns list the sepal and petal length and width in centimentres while the fifth column identifies the species.

Sometimes you will see the rows and columns reffered to under different names, these are listed below for your utility:

* A row is an observation, sample, example, instance, record
* A column is a feature, predictor, attribute, independent variable, input, regressor, covariate

- - - 

To verify that we do indeed have 50 samples of each species let's check that with a simple pandas function.

In [24]:
irisDataset['species'].value_counts()

virginica     50
setosa        50
versicolor    50
Name: species, dtype: int64

Additionally let's check their types so we know what we will be handling.

In [21]:
# Check data type of each of the columns
irisDataset.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

- - -

We can also see if there is any other information we need to know about with the `info` function

In [23]:
print(irisDataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
None


- - -

Let's create a summary of each attribute of the dataset with the `describe` function. This way we can view the central tendency, dispersion and shape of a dataset’s distribution.

In [27]:
irisDataset.describe() 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# Visualizing the Dataset

## References

https://www.ritchieng.com/machine-learning-iris-dataset/

https://www.youtube.com/watch?v=hd1W4CyPX58

https://www.kaggle.com/adityabhat24/iris-data-analysis-and-machine-learning-python