# Background

In this notebook I will be exploring the Iris dataset. This dataset is one of the most well known datasets that is used for pattern recognition. This dataset is available from the UCI Machine Learning Repository which has many well known ans widely used collection of datasets for use in machine Learning. The dataset was created by British statistician and biologist Ronald Fisher. In 1936 Fisher introduced the Iris flower data set as an example of discriminant analysis which he proposed as a method to predict qualitative values. He used it to distinguish the different species of Iris flowers from each other using the combination of the four measurement variables in the data set.
![alt text](https://www.bing.com/images/search?view=detailV2&ccid=NpzNod9D&id=432D817D4B669C5F2E730A42EB5668AA05E7873E&thid=OIP.NpzNod9DFIWoYNp0YpfD_wHaLd&mediaurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.369ccda1df431485a860da746297c3ff%3frik%3dPofnBapoVutCCg%26riu%3dhttp%253a%252f%252fwww.42evolution.org%252fwp-content%252fuploads%252f2014%252f07%252fRonald-Fisher-from-Royal-Society.jpg%26ehk%3ddEZNCo3gJ0op2eBgYxl%252bZVaXpdcNTKP9XEZ3x3JYqyg%253d%26risl%3d%26pid%3dImgRaw%26r%3d0&exph=3331&expw=2154&q=ronald+fisher&simid=607995768591287951&FORM=IRPRST&ck=E18E40B2DE221E156B7AD7AE782C90B0&selectedIndex=0 "Ronald fisher")

While Fisher used the Iris data set as an example to demonstrate statistical methods of classification, the Iris data itself was actually collected by Edgar Anderson, an American botanist and geneticist. Fishers's Iris data set is therefore often known as Anderson's Iris data set. Anderson was  particularly interested in the variation in plant species or a group of species, and in evolution in general. Anderson carefully examined the individual characters of the iris plants that were growing in different conditions. He used scatter diagrams and ideographs - simplified diagrams which he developed himself - so that he could visualise and compare the data more easily. These methods helped him to come to conclusions about the data.





# Contents of the Iris dataset

The Iris dataset contains three classes with a sample of three 3 classes, with each class referencing a type of iris plant. There are 50 instances of each class, with 150 instances in total.

Each class has 4 attributes, or variables, which are:

1. Sepal length in centimetres
2. Sepal width in centimetres
3. Petal length in centimetres
4. Petal width in centimetres

the classes include 
1. Iris Setosa
2. Iris Versicolor
3. Iris Virginica

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from tabulate import tabulate
import seaborn as sns
import sys

csv_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

cols_names = ["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width", "Class"]

iris_data = pd.read_csv(csv_url, names=cols_names)

print(f"{iris_data.info}")
print(f"\n{iris_data.head(5)}")
print(f"{iris_data.tail(5)}")
print(f"{iris_data.sample(5)}")


# get mean of group values
iris_data.groupby("Class").mean()

# get median of group values
iris_data.groupby("Class").median()


print(iris_data.mean())

# Can look at the summary statistics for each class of Iris in the data set.
# I transposed the results to make it easier to read.
print("summary statistics for each Class of Iris in the data set \n")
print(iris_data.groupby("Class").describe().T)

<bound method DataFrame.info of      Sepal_Length  Sepal_Width  Petal_Length  Petal_Width           Class
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]>

   Sepal_Length  Sepal_Width  Petal_Length  Petal_Widt

  print(iris_data.mean())


In [11]:
print("A table of statistics for each variable can be seen below.")
print("The first table displays statistics of the data set as a whole, while Tables 2,3,4 and 5")
print("display statistics such as mean, min and max for each of the 4 variables when grouped together by class.\n")

print("Table 1 - Iris Dataset Statistics")
iris_stats = iris_data.agg({'Sepal_Length': ['mean', 'min', 'max', 'std'],
                           'Sepal_Width': ['mean', 'min', 'max', 'std'],
                           'Petal_Length': ['mean', 'min', 'max', 'std'],
                           'Petal_Width': ['mean', 'min', 'max', 'std']})
print(tabulate(iris_stats, headers = ["Stat", "Sepal_Length (cm)", "Sepal_Width (cm)", "Petal_Length (cm)", "Petal_Width (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Sepal Length Statistics
print("Table 2 - Iris Dataset Sepal Length Statistics")
table_of_data = iris_data.groupby('Class').agg({'Sepal_Length': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Sepal Width Statistics
print("Table 3 - Iris Dataset Sepal Width Statistics")
table_of_data = iris_data.groupby('Class').agg({'Sepal_Width': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Petal Length Statistics
print("Table 4 - Iris Dataset Petal Length Statistics")
table_of_data = iris_data.groupby('Class').agg({'Petal_Length': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Petal Width Statistics
print("Table 5 - Iris Dataset Petal Width Statistics")
table_of_data = iris_data.groupby('Class').agg({'Petal_Width': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print("\n")



A table of statistics for each variable can be seen below.
The first table displays statistics of the data set as a whole, while Tables 2,3,4 and 5
display statistics such as mean, min and max for each of the 4 variables when grouped together by class.

Table 1 - Iris Dataset Statistics
+--------+---------------------+--------------------+---------------------+--------------------+
|  Stat  |   Sepal_Length (cm) |   Sepal_Width (cm) |   Petal_Length (cm) |   Petal_Width (cm) |
|  mean  |            5.84333  |           3.054    |             3.75867 |           1.19867  |
+--------+---------------------+--------------------+---------------------+--------------------+
|  min   |            4.3      |           2        |             1       |           0.1      |
+--------+---------------------+--------------------+---------------------+--------------------+
|  max   |            7.9      |           4.4      |             6.9     |           2.5      |
+--------+---------------------+-