# Fundaments of Data Analysis Project

**Linda Grealish**

***

# Table of Contents
1. [Introduction](#overview)  
    - [Problem Statement](#problem-statement)
    - [About this notebook and technologies used](#notebook)  
    - [About the Iris dataset](#iris-dataset)  <br/><br/>
2. [Classification of the variables](#classification)
    - [Loading dataset and python libraries](#load-dataset-libraries)
     

<a id="overview"></a>

# 1. Introduction and Project Overview: 

This notebook contains my submission for the Fundamentals of Data Analysis Module 2023 module at ATU as part of the Higher Diploma in Computing and Data Analytics.  The topic of the project is research and investigation of Fisher's Iris dataset.

<a id="problem-statement"></a>
## Problem statement:

*• The project is to create a notebook investigating the variables and data points within the well-known iris flower data set associated with Ronald A Fisher.*

*• In the notebook, you should discuss the classification of each variable within the data set according to common variable types and scales of measurement in mathematics, statistics, and Python.*

*• Select, demonstrate, and explain the most appropriate summary statistics to describe each variable.*

*• Select, demonstrate, and explain the most appropriate plot(s) for each variable.*

*• The notebook should follow a cohesive narrative about the data set.*

<a id="notebook"></a>
## About this notebook and python libraries used

This project was mainly developed using the python and the following packages:

- Numpy.random is a subpackage of the `NumPy` package for working with random numbers. NumPy is one of the most important packages for numerical and scientific computing in Python.
- Seaborn is a Python data visualization library for making attractive and informative statistical graphics in Python.
- Pandas provides data analysis tools and is designed for working with tabular data that contains an ordered collection of columns where each column can have a different value type. 
- Matplotlib


<a id="iris-dataset"></a>
## The Iris dataset 

Iris flower data, also known as Fisher's Iris dataset was introduced by British biologist and statistician Sir Ronald Aylmer Fisher. In 1936, Sir Fisher published a report titled [“The Use of Multiple Measurements in Taxonomic Problems”](https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1469-1809.1936.tb02137.x) in the journal Annals of Eugenics. In this article, Fisher developed and evaluated a linear function to differentiate Iris species based on the morphology of their flowers. It was the first time that the sepal and petal measures of the three Iris species as mentioned above appeared publicly.

The *iris* dataset is available in the [seaborn-data repository](https://github.com/mwaskom/seaborn-data) belonging to Michael Waskom - the creator of the [seaborn](https://seaborn.pydata.org/index.html) python data visualisation package.

The *isis* dataset illustrates the “tidy” approach to organizing a dataset. [Tidy data](https://en.wikipedia.org/wiki/Tidy_data) is an alternate name for the common statistical form called a model matrix or data matrix which is a

>A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.


It is a multivariate data set of 50 samples which the author gathered on each of three species of Irises: setosa, versicolor and virginica. Measurements of 4 properties of 50 flowers of each of the plants were taken, namely Sepal length, Sepal width, Petal Length, and Petal width. The author suggests that the petal and sepal lengths and widths are characteristics which can be used to identify which species they belong to based on a linear discriminant modelLinear discriminant analysis. Fisher himself developed the linear discriminant model, a statistical, machine learning and pattern recognition technique used to distinguish between two or more objects, classes or events. Iris Data set wikipedia Fisher presented the data for the 3 species in a table with each of the four measurements and subsequently, tables of observed means, sums of squares etc. in order to demonstrate how each species can be discriminated from one another.


<img src = https://github.com/lgrealish/pands-project/blob/main/iris-species-image.png alt= "Iris flower species">

<a id="classification"></a>

# 2. Classification of the variables within the dataset

<a id="load-dataset-libraries"></a>
## Loading dataset and python libraries

In [3]:
# Import modules necessary for this task
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Read in the iris dataset from the online source and create a dataframe named df.
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'

df = pd.read_csv(url)

Now that the csv file has been succesfully read into a pandas dataframe object, , I want to check that the dataframe looks ok.  The pandas df.head() and df.tail() functions are a useful way to check if a csv file has been read in properly, particularly the tail() function as any problems usually appear towards the end of the dataframe, throwing out the last number of rows but all looks well here.

In [10]:
# Look at the top 5 rows of the DataFrame df
print("The first few rows in the dataset: \n\n", df.head()) 

# Look at the bottom 5 rows of the DataFrame
print('\n The final few rows in the dataset \n',df.tail()) 

The first few rows in the dataset: 

    sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

 The final few rows in the dataset 
      sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica


In [12]:
# Summarise the data types contained in the df
print("The dtypes in the dataframe are:", end='\n\n')

print(df.dtypes) 

The dtypes in the dataframe are:

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object


In [7]:
# Checking for any null values
df.isnull()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False
148,False,False,False,False,False


In [19]:
# Checking for any null values
print(*df.isna().any()) 

False False False False False


From the above code we can see that; 

- there is 1 non-numerial object column and there are 4 numerical columns

- there are no null values

In [18]:
df.describe(include='object')


Unnamed: 0,species
count,150
unique,3
top,setosa
freq,50


In [20]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Sources

Classification

https://www.pycodemates.com/2022/05/iris-dataset-classification-with-python.html?utm_content=cmp-true

https://towardsdatascience.com/classification-basics-walk-through-with-the-iris-data-set-d46b0331bf82

https://rasbt.github.io/mlxtend/user_guide/data/iris_data/

https://www.geeksforgeeks.org/python-basics-of-pandas-using-iris-dataset/

https://www.kaggle.com/code/sixteenpython/machine-learning-with-iris-dataset

https://medium.com/@estebanthi/classification-analysis-use-case-the-iris-dataset-99b3902b708b

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

https://rpubs.com/shahworld/penguin



***

## End