# Fundamentals of Data Analysis Project Notebook - Iris Dataset Analysis
Author - Sean Humphreys

---

### Contents

1. [Problem Statement](#problem-statement)
2. [Background](#background)
3. [Background Reading](#background-reading)
4. [Pandas](#pandas)

---

## Problem Statement <a id="problem-statement"></a>

- create a notebook investigating the variables and data points within the well-known iris flower data set associated with Ronald A Fisher.
- in the notebook, you should discuss the classification of each variable within the data set according to common variable types and scales of measurement in mathematics, statistics, and Python.
- Select, demonstrate, and explain the most appropriate summary statistics to describe each variable.
- Select, demonstrate, and explain the most appropriate plot(s) for each variable
- The notebook should follow a cohesive narrative about the data set.
---

## Background <a id="background"></a>

The [Iris Data Set](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a data set that consists 50 samples from 
three species of Iris - Iris-setosa, Iris-virginica and Iris-versicolor. Each sample contains 4 measurements:
1. Petal width
2. Petal length
3. Sepal width
4. Sepal length

This data set is an example of a multivariate data set and was popularised by statistician and biologist [Sir Ronald
Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in his 1936 paper entitled 
[*"The use of multiple measurements in taxonomic problems"*](https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf)
. The data was collected by [Dr. Edgar Anderson](https://en.wikipedia.org/wiki/Edgar_Anderson) from the Gaspé Peninsula
in Canada. Two of the three species were collected from the same meadow, by the same person, using standard equipment 
in order to minimise the risk of variation in the data samples arising from the way in which it was collected and 
measured. Dr. Anderson is recognised as a significant contributor in the field of botanical genetics.

As can be seen from the picture below the appearance of each species is similar. Sir Fisher's Analysis of the data set
enabled accurate classification of the species from petal and sepal measurement and as a result the data set is 
routinely used as a beginners dataset for machine learning purposes.

![image 1](images/illustrations/Iris_Image.png "Iris Species")

---

## Pandas <a id="pandas"></a>

Using *Python*, import the Iris dataset.

In [1]:
from ucimlrepo import fetch_ucirepo

# fetch dataset 
iris = fetch_ucirepo(id=53)

In [2]:
# fetch dataset 
iris = fetch_ucirepo(id=53)

[Pandas](https://pandas.pydata.org/) is an open source software library used in data analytics that allows data analysis and manipulation. Pandas is built on top of the *Python* programming language. A Pandas DataFrame is a dictionary like container for Series objects. A DataFrame is the primary Pandas data structure. Using Pandas it is possible to define variables to put the Iris dataset into Pandas DataFrames.

In [3]:
# data (as pandas dataframes) 
iris_ds_features = iris.data.features 
iris_ds_targets = iris.data.targets

# add the 'class' variable from iris_ds_targets variable into the iris_ds_features Pandas DataFrame
iris_ds_features['class'] = iris_ds_targets['class']

To confirm the DataFrame data structure, the built-in *type* function in Python is used. 

In [4]:
print(type(iris_ds_features))
print(type(iris_ds_targets))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


A benefit of using Pandas is that it can be used to make datasets readable. In this instance Pandas is used to combine the two datasets into one DataFrame and rename the columns. Further cleaning of the dataset is carried out to remove the "*Iris-*" string from the species entries so that just the subgenus is left as a string in the species column.

In [5]:
# add the 'class' variable from iris_ds_targets variable into the iris_ds_features Pandas DataFrame
iris_ds_features['class'] = iris_ds_targets['class']

# rename the DataFrame Columns
# Or rename the existing DataFrame (rather than creating a copy) 
iris_ds_features.rename(columns={'sepal length': 'sepal_length_cm', 'sepal width': 'sepal_width_cm', 'petal length': 'petal_length_cm', 'petal width': 'petal_width_cm', 'class': 'species'},  inplace=True)

# remove "Iris-" string from species entries - https://www.statology.org/pandas-remove-characters-from-string/ [Accessed 13 Oct. 2023]
iris_ds_features['species'] = iris_ds_features['species'].str.replace('Iris-', '')

Using the *Pandas head method* the first 5 lines of the DataFrame can be viewed. From the output 5 variables are identified within the Iris Data

In [6]:
iris_ds_features.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Background Reading <a id="background-reading"></a>

archive.ics.uci.edu. (n.d.). UCI Machine Learning Repository. [online] Available at: https://archive.ics.uci.edu/dataset/53/iris. [Accessed 12 Oct. 2023].

Ekiz, A. (2023). Creating Table of Contents in Jupyter Notebook. [online] Medium. Available at: https://medium.com/@ahmetekiz/creating-table-of-contents-in-jupyter-notebook-52a7c696817f [Accessed 11 Oct. 2023].

https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf  [accessed 11 Oct. 2023].

Pandas (2018). Python Data Analysis Library — pandas: Python Data Analysis Library. [online] Pydata.org. Available at: https://pandas.pydata.org/. [Accessed 13 Oct. 2023].

Stack Overflow. (n.d.). python - Renaming column names in Pandas. [online] Available at: https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas. [Accessed 13 Oct. 2023].

Wikipedia. (2023). Edgar Anderson. [online] Available at: https://en.wikipedia.org/wiki/Edgar_Anderson [Accessed 11 Oct. 2023].

Wikipedia Contributors (2019). Iris flower data set. [online] Wikipedia. Available at: https://en.wikipedia.org/wiki/Iris_flower_data_set. [Accessed 11 Oct. 2023].
‌
Wikipedia Contributors (2019). Ronald Fisher. [online] Wikipedia. Available at: https://en.wikipedia.org/wiki/Ronald_Fisher. [Accessed 11 Oct. 2023].

Zach (2022). Pandas: How to Remove Specific Characters from Strings. [online] Statology. Available at: https://www.statology.org/pandas-remove-characters-from-string/. [Acessed 13 Oct. 2023].