# Introduction
This Documentation file covers what is Fisher's Iris Dataset, why it is used in data analysis, methods for analysing data in Python and then examples of how this can be done with the Fisher's Iris Dataset.

# The Fisher's Iris Dataset

## What is Fisher's Iris Dataset?
The Fisher's Iris dataset is a simple dataset of 150 records: 50 records of 3 different types of irises. There are 5 variables in the dataset:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

The data was collected by Edgar Anderson who was trying to work out how one species of Iris evolved from another. He chose to study Iris Versicolour, but subsequently discovered this was actually two species: Iris Versicolour and Iris Virginica. He added to his investigation Iris Setosa. From his investigation he discovered that the Iris Versicolour in North East America as a breeding of Iris Viginica and Irish Setosa.  

The dataset was published by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems [1]. From analysing the data Fisher showed that Iris Setosa could be separated from Iris Versicolour and Iris Virginica. Through this project we are going to use Python to create some visual representations of the data that confirms this differentiation. The data set was donated in 1988 by Michael Marshall and is published to the public at UCI Machine Learning Repository [2]. 

## Where can it be located?

## What are its uses?

# Analysing data with Python

## Working with numerical data: numpy
Numpy is a library designed for working with numerical data. You first need to ensure it is installed on your computer and then import it in your code.

In [8]:
import numpy as np

## Working with datasets: PANDAS
The most commonly used tool in Python for working with data sets is Pandas [3]. It is a Python library with functions for analysing, cleaning, exploring and manipulating data. The codebase for Pandas is available at: https://github.com/pandas-dev/pandas.

Before using Pandas it will need to be installed.  If you already have Spyder or Anaconda installed this already contains Pandas and you don't need to install it again. 

Once it is installed it is added to my applications by importing it, this is normally done under the alias pd:

In [13]:
import pandas as pd


An easy each way to check if it is installed is to request the version number to be printed to the terminal:

In [14]:
print(pd.__version__)

1.4.4


Pandas allows tabulated data such as csv to be imported as a data table, in Pandas this is called `DataFrame`.  Pandas allows the importing of data from a variety of file formats or data sources using the format `read_` source type, eg `read_csv` will read in data from a csv file. By default the first line of the file is assumed to be the header file. If the file does not have headers these can added by using names = " ", " "]. 

In [16]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", 
                   names = ["sepal length", "sepal width", "petal length", "petal width", "class"])

A quick view of the data can be seen by printing `head()` or `info()`. Here I have used `info()` to confirm the column names, the number of entries in each column and the datatype.

In [17]:
print (df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal length  150 non-null    float64
 1   sepal width   150 non-null    float64
 2   petal length  150 non-null    float64
 3   petal width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


 

When the data is imported you can filter on a condition, or select by row and/or column. 

## Plotting data
Data visualisation helps makes complex data sources more easy to understand via graphics. Matplotlib and Seaborn are the most commonly used data visualisation tools in Python.
### Matplotlib
Matplotlib is a library for the plotting of data. It works alongisde Pandas and Numpy. The figures can be interactive, the plots can be formatted and exported into a variety of formats. Matplotlib is normally imported under the alias 'plt'. 

In [10]:
import matplotlib.pyplot as plt

### Seaborn
Is also a Python library and it can make more complicated plots.

### Matplotlib or Seaborn: Which is better?
As is often the case with this question, the answer is, it depends on what you are trying to do. Matplotlib is good at making basic graphs including barcharts, piecharts and scatterplots. Seaborn is an extended version of matplotlib and uses matplotlib, pandas and numpy to make visualisations. The following table is based on the one at: https://www.geeksforgeeks.org/difference-between-matplotlib-vs-seaborn/. 
|Area |Matplotlib |Seaborn |
|--|-----------|---------|
|Function|Basic charts including bar charts, pie charts and scatter plots | More complicated plots. It can provide a plot of all data one plot.|
|Syntax|Is quite lengthy eg matplotlib.pyplot.hist(x_axis, y_axis)|Is a little simpler and easier to learn eg seaborn.histplot(x_axis, y_axis)|
|Multiple plots|Can open and use multiple plots at a time. Need to be closed.|Time set for the creation of each plot, this can lead to out of memory errors|
|Visualisation|Provides similar output and syntax as MATLAB, so great for those already familiar with MATLAB| More comfortable at handling Pandas dataframes. Great variety of nice looking visualisations can be created|
|Data frames and arrays|Works efficiently with data frames and arrays. Plot() can be called without parameters.|Whole data set is treated as a simple unit and parameters are needed when calling plot().|

# How to run the code

## Importing the libraries 
The Fisher's Iris data set is published at https://archive.ics.uci.edu/ml/datasets/iris. The CSV file for the data is at https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data and the details for the headers is in https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names.  Therefore, when importing the dataset we add the argument 'header=None' so the first row is seen as data rather than the header and 'names=' to add the correct header data. 

In [11]:
import pandas as pd
colname = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Class"]
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", 
                   names = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Class"])

In [12]:
iris.shape

(150, 5)

# Analysing Fisher's Iris Dataset
## Summary text file
A summary of rach variable to a single text file

## Histogram of each variable
Historgrams to be saved to the repository

## Scatterplots of pairs of variables
To be saved in the repository

## examples of interesting analyses that others have pursued based on 
the data set will be discussed

# References
[1] Fisher paper

https://github.com/pandas-dev/pandas