# Project

- The project is to create a notebook investigating the variables and
data points within the well-known iris flower data set associated
with Ronald A Fisher.
- In the notebook, you should discuss the classification of each
variable within the data set according to common variable types
and scales of measurement in mathematics, statistics, and Python.
- Select, demonstrate, and explain the most appropriate summary
statistics to describe each variable.
- Select, demonstrate, and explain the most appropriate plot(s) for
each variable.
- The notebook should follow a cohesive narrative about the data
set.

###### Source file: Iris – UCI Machine Learning Repository. Aug. 17, 2023. url: https://archive.ics.uci.edu/dataset/53/iris

## ATU -Fundamentals of Data Analysis, Winter 2023/24
##### Author: Norbert Antal

## **Investigating the Iris Dataset**

### **1. Introduction**

##### 1.1 Origins of the data
Fisher’s Iris data set, is a multivariate data set that was introduced by British statistician and biologist Ronald Fisher. <br>The data was collected by Edgar Anderson to quantify the morphologic variation of three related Iris flower species. Fisher utilised the data to demonstrate the use of linear discriminant analysis in his 1936 paper, published in the Annals of Eugenics; “The use of multiple measurements in taxonomic problems”.
<br>In recent times, the dataset is often used as a typical test case for statistical classification in machine learning.<br>


##### 1.2 Contents of the Iris dataset 
The Iris dataset contains 50 samples of three different Iris flower species: Iris setosa, Iris virginica, and Iris versicolor. Each sample has four features measured in centimetres: sepal length, sepal width, petal length and petal width. Using these four sets of data, Ronald Fisher developed a linear discriminant model to differentiate between the species.<br>

###### (ref: https://en.wikipedia.org/wiki/Iris_flower_data_set)

### **Preparation**

##### Software used for this project

+ VS Code editor
+ Python version 3.9.13 with imported libraries:
  + *pandas* - for data manipulation and analysis (ref: https://en.wikipedia.org/wiki/Pandas_(software))
  
##### Data source:
Source files downloaded from https://archive.ics.uci.edu/ml/datasets/iris

### Reading in and validating data
Files 'iris.data' and 'iris.names' were copied to the project folder for easy access.
Data will be analysed using mainly Pandas which is a popular data analysis library in Python that provides user-friendly data structures and data analysis tools. The comma separated value file is converted to Pandas DataFrame which is a two-dimensional table with labelled columns and rows, similar to a spreadsheet. (ref: https://towardsdatascience.com/a-python-pandas-introduction-to-excel-users-1696d65604f6)

Iris flower measurement values are imported from **iris.data**, header labels added manually from the data description in **iris.names** and the two combined into a pandas dataframe. <br>

In [17]:
# ------------- load modules  --------------
import pandas as pd # for data analysis and dataframe

#----read in data and give headers to each column, creating a dataframe-----------------------
SOURCEDATA="iris.data" # store path of source file in global variable
#----read in data and add headers to each column (headers taken from iris.names)--------------
headers=[ #adding headers to dataframe
    "sepal length (cm)", 
    "sepal width (cm)", 
    "petal length (cm)", 
    "petal width (cm)",
    "species"]
df=pd.read_csv(SOURCEDATA, names=headers) # creating dataframe 'df', setting headers, excluding index column

In [19]:
# ---- Data Validation -------------
# check for anomalies and dataframe structure
# ref: https://www.tutorialspoint.com/exploratory-data-analysis-on-iris-dataset
def fn_datavalidation(): 
    print("\n-------> dataframe structure: \n")
    print(df.head(3)) # first 3 lines of data
    print("\n-------> dataframe info: \n")
    print(df.info()) #outputs column names, count of non-null values and datatypes
    print("\n-------> Checking for Null entries: \n")
    print(df.isnull().sum()) #outputs the number of null entries in the dataframe
    print("\n")
#-------------------------end of function
fn_datavalidation()


-------> dataframe structure: 

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   

       species  
0  Iris-setosa  
1  Iris-setosa  
2  Iris-setosa  

-------> dataframe info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

-------> Checking for Null entries: 

sepal length (cm) 