## Fundamentals of Data Analysis - Project ##

**Name: James McEneaney** <br/><br/>
**Course: Higher Diploma in Computing in Data Analytics, ATU Ireland** <br/><br/> 
**Semester: Semester 2 2023** <br/><br/>


***

## Introduction ##

This project sets out to analyse the data points and variables within Fisher's Iris Dataset. For each variable, the focus will be on the classifcation of the variable according to it's type as represented in Python code, utilising appropriate summary statistics to analyse the variable and displaying variables using the most appropriate plots.

I am using Visual Studio Code (VS Code) (version 1.83.1) to write my project and to upload it to my repository on GitHub for assessment. I am also using Jupyter code cells within VS Code.

This file will firstly give some background information to the dataset in question. 

I will then outline the steps which I needed to take before I could begin the actual analysis itself: downloading the dataset, preparing the dataset by adding the correct headings, and importing the modules, libraries and packages which I used to complete the project.

Next, I will outline some summary statistics relating to the variables within the dataset. In addition to the summary statistics generated by the .describe() method, I calculated, for the dataset as a whole, figures for coefficient of variation and skewness. I also calculated the correlation coefficients for the six possible pairs of traits. I have included a description of these statistics in the 'Summary of each variable' section and my interpretation of the results. My script will redirect these summary statistics for each of the four variables (ie. traits) on to a text file.

I will then carry out data visualisation on the Iris dataset, using histograms and scatterplots. My script will save each plot generated from the data as a .png file and these can be viewed below. I will discuss my interpretations of the histograms and scatterplots.

My project will conclude with an overall summary of my findings and my thoughts upon conclusion of the project. I will also provide a list of references (using APA reference style) which I used to complete my work.


### Background of dataset ###

The data set was collected in 1935 by the American botanist Edgar Anderson and used in 1936 by the British statistician and biologist Ronald A. Fisher. It relates to data collected from samples of three species of the Iris flowering plant genus: Iris setosa, Iris virginica, and Iris versicolor. \
It is commonly used as an introductory data set by people who are learning how to analyse and visualise data using programming languages.

Fifty samples were collected for each species, giving one hundred and fifty samples in total. \
For each sample, four features of the flower were measured; these were: sepal length, sepal width, petal length and petal width. These attributes of the samples are contained in columns 1, 2, 3 and 4 respectively within the dataset. The species name of the flower is also included in the dataset in column 5.

The petal of a flowering plant are the leaves of the flower which surround the reproductive parts of the flower, and which are often brightly coloured to attract pollinators. Sepals usually protect the flower when it is in a bud and structurally support the petals when the flower is in bloom.

The below is an image of the three flowers analysed in the dataset, along with a label for the sepal and petal of one of the three flower species (Iris Versicolor):

![image](https://raw.githubusercontent.com/jmce22/pands-project/main/iris_flowers.png)

### Pre-analysis ###

I downloaded the Iris dataset from https://archive.ics.uci.edu/ml/datasets/iris. I saved it as a .csv file into the folder where my I worked on my project. There were no headings for the data when I opened the dataset; however, the site which I downloaded the data from did provide the information about what each of the five columns in the dataset represents. I used this to add the headings to the data so I could manipulate it.

The headings for the five columns were given as below:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
* Iris setosa
* Iris versicolour
* Iris virginica

To enable me to analyse the dataset, I will import some libraries and modules commonly used for this purpose:

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

These libraries are as follows:

* *Matplotlib* is a library used by Python to make plots and graphs. It requires NumPy to run. *Matplotlib.pyplot* is a collection functions which allows us to do different things to the plots we make, such as create them, add headings, change the colour scheme etc.

* *NumPy* (Numerical Python) is a package used in Python to carry out mathematical operations on numerical datatypes, such as integers and floating-point numbers. It creates multi-dimensional array objects which allow Python to carry out mathematical operations much more efficiently than would be the case in Python without NumPy. 

* *pandas* is built on top on NumPy and is a powerful and flexible Python package used for data analysis, especially of tabular data, such as the data in the .csv file used for this project. I used pandas to open the Iris dataset. Pandas creates data-structures which allow data to be manipulated, with the most important being 1-dimensional data 'series' and 2-dimensional 'DataFrames' (the structure which is used here to manipulate the Iris data). The DataFrame in pandas stores data as a two-dimensional structure where each piece of information has a row and column label.

* *Seaborn* is built on top of matplotlib. It enables us to make more appealing plots, utilising different styles.

Next, I will create a pandas DataFrame object to use in my analysis. I will do this by reading the .csv file with pandas and assigning names to each column, based on the information available on the website I downloaded the .csv file from:

In [13]:
df = pd.read_csv("iris.data.csv", names=['sepal length', 'sepal width', 'petal length', 'petal width', 'variety']) 

We can get a sense of what information is contained within the dataset by printing out the first five and last five rows from it:

In [4]:
print(df)

     sepal length  sepal width  petal length  petal width         variety
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


It will also be useful to create sub-dataframes from the overall dataframe, filtered by each of the three Iris flowers:

In [14]:
df_setosa = df[df['variety'] == 'Iris-setosa']
df_versicolor = df[df['variety'] == 'Iris-versicolor']
df_virginica = df[df['variety'] == 'Iris-virginica']

### Variable types in data set ###

Fisher's Iris dataset consists of five variables: sepal length, sepal width, petal length, petal width and the variety of the Iris flower. To determine what "type" Python classifies each variable as, we can use the function df.dtypes:

In [5]:
print(df.dtypes)

sepal length    float64
sepal width     float64
petal length    float64
petal width     float64
variety          object
dtype: object


We find that each of the numerical variables are classified as floating point numbers. Additionally, we know that each of these four variables are *ratio* variables as opposed to *interval* variables; for a ration variable, a value of zero corresponds to an absence of that variable, while for an interval variable, a value of zero can be just a point on the scale of measurement.

### Summary statistics ###

In [6]:
summary_stats = df.describe()
transpose_summary = np.transpose(summary_stats)      # this makes the summary statistics easier to read
summary= transpose_summary
print(summary)

              count      mean       std  min  25%   50%  75%  max
sepal length  150.0  5.843333  0.828066  4.3  5.1  5.80  6.4  7.9
sepal width   150.0  3.054000  0.433594  2.0  2.8  3.00  3.3  4.4
petal length  150.0  3.758667  1.764420  1.0  1.6  4.35  5.1  6.9
petal width   150.0  1.198667  0.763161  0.1  0.3  1.30  1.8  2.5


I also want to meausre a statistic called the coefficient of variation: this measures the relative dispersion of data points in a data series around the mean and is measured by dividing the standard deviation by the mean. Firstly we calculate the standard deviation and mean for each variable and secondly we used these figures to calculte the coefficient of variation:

In [21]:
meansl = df["sepal length"].mean()
sdsl = df["sepal length"].std()
meansw = df["sepal width"].mean()
sdsw = df["sepal width"].std()
meanpl = df["petal length"].mean()
sdpl = df["petal length"].std()
meanpw = df["petal width"].mean()
sdpw = df["petal width"].std()

cov_sl = (sdsl/meansl)
cov_sw = (sdsw/meansw)
cov_pl = (sdpl/meanpl)
cov_pw = (sdpw/meanpw)

# rounding to 2 decimal places
round_cov_sl = round(cov_sl, 2)
round_cov_sw = round(cov_sw, 2)
round_cov_pl = round(cov_pl, 2)
round_cov_pw = round(cov_pw, 2)

print(f'Covariance of sepal length is = {round_cov_sl}')
print(f'Covariance of sepal width is = {round_cov_sw}')
print(f'Coveriance of petal length is = {round_cov_pl}')
print(f'Covariance of petal width is  = {round_cov_pw}')

Covariance of sepal length is = 0.14
Covariance of sepal width is = 0.14
Coveriance of petal length is = 0.47
Covariance of petal width is  = 0.64


The above figures relate to dataset as a whole. To investigate the covariance of each of the four traits for each flower, we can use the sub-dataframes created earlier:

*Covariance of traits in Setosa:*

In [18]:
meansl_setosa = df_setosa["sepal length"].mean()
sdsl_setosa = df_setosa["sepal length"].std()
meansw_setosa= df_setosa["sepal width"].mean()
sdsw_setosa = df_setosa["sepal width"].std()
meanpl_setosa = df_setosa["petal length"].mean()
sdpl_setosa = df_setosa["petal length"].std()
meanpw_setosa = df_setosa["petal width"].mean()
sdpw_setosa = df_setosa["petal width"].std()

cov_sl_setosa = (sdsl_setosa/meansl_setosa)
cov_sw_setosa= (sdsw_setosa/meansw_setosa)
cov_pl_setosa= (sdpl_setosa/meanpl_setosa)
cov_pw_setosa = (sdpw_setosa/meanpw_setosa)

# rounding to 2 decimal places
round_cov_sl_setosa = round(cov_sl_setosa, 2)
round_cov_sw_setosa = round(cov_sw_setosa, 2)
round_cov_pl_setosa = round(cov_pl_setosa, 2)
round_cov_pw_setosa = round(cov_pw_setosa, 2)

print(f'Covariance of sepal length for Setosa = {round_cov_sl_setosa}')
print(f'Covariance of sepal width for Setosa = {round_cov_sw_setosa}')
print(f'Covariance of petal length for Setosa= {round_cov_pl_setosa}')
print(f'Covariance of petal width for Setosa = {round_cov_pw_setosa}')

Covariance of sepal length for Setosa = 0.07
Covariance of sepal width for Setosa = 0.11
Covariance of petal length for Setosa= 0.12
Covariance of petal width for Setosa = 0.44


*Covariance of traits in Versicolor:*

In [19]:
meansl_versicolor = df_versicolor["sepal length"].mean()
sdsl_versicolor = df_versicolor["sepal length"].std()
meansw_versicolor= df_versicolor["sepal width"].mean()
sdsw_versicolor = df_versicolor["sepal width"].std()
meanpl_versicolor = df_versicolor["petal length"].mean()
sdpl_versicolor = df_versicolor["petal length"].std()
meanpw_versicolor = df_versicolor["petal width"].mean()
sdpw_versicolor = df_versicolor["petal width"].std()

cov_sl_versicolor = (sdsl_versicolor/meansl_versicolor)
cov_sw_versicolor = (sdsw_versicolor/meansw_versicolor)
cov_pl_versicolor = (sdpl_versicolor/meanpl_versicolor)
cov_pw_versicolor = (sdpw_versicolor/meanpw_versicolor)

# rounding to 2 decimal places
round_cov_sl_versicolor = round(cov_sl_versicolor, 2)
round_cov_sw_versicolor = round(cov_sw_versicolor, 2)
round_cov_pl_versicolor = round(cov_pl_versicolor, 2)
round_cov_pw_versicolor = round(cov_pw_versicolor, 2)

print(f'Covariance of sepal length for Versicolor = {round_cov_sl_versicolor}')
print(f'Covariance of sepal width for Versicolor = {round_cov_sw_versicolor}')
print(f'Covariance of petal length for Versicolor = {round_cov_pl_versicolor}')
print(f'Covariance of petal width for Versicolor = {round_cov_pw_versicolor}')

Covariance of sepal length for Versicolor = 0.09
Covariance of sepal width for Versicolor = 0.11
Covariance of petal length for Versicolor = 0.11
Covariance of petal width for Versicolor = 0.15


*Covariance of traits in Virginica:*

In [20]:
meansl_virginica = df_virginica["sepal length"].mean()
sdsl_virginica = df_virginica["sepal length"].std()
meansw_virginica= df_virginica["sepal width"].mean()
sdsw_virginica = df_virginica["sepal width"].std()
meanpl_virginica = df_virginica["petal length"].mean()
sdpl_virginica = df_virginica["petal length"].std()
meanpw_virginica = df_virginica["petal width"].mean()
sdpw_virginica = df_virginica["petal width"].std()

cov_sl_virginica = (sdsl_virginica/meansl_virginica)
cov_sw_virginica = (sdsw_virginica/meansw_virginica)
cov_pl_virginica = (sdpl_virginica/meanpl_virginica)
cov_pw_virginica = (sdpw_virginica/meanpw_virginica)

# rounding to 2 decimal places
round_cov_sl_virginica = round(cov_sl_virginica, 2)
round_cov_sw_virginica = round(cov_sw_virginica, 2)
round_cov_pl_virginica = round(cov_pl_virginica, 2)
round_cov_pw_virginica = round(cov_pw_virginica, 2)

print(f'Covariance of sepal length for Virginica = {round_cov_sl_virginica}')
print(f'Covariance of sepal width for Virginica = {round_cov_sw_virginica}')
print(f'Covariance of petal length for Virginica = {round_cov_pl_virginica}')
print(f'Covariance of petal width for Virginica = {round_cov_pw_virginica}')

Covariance of sepal length for Virginica = 0.1
Covariance of sepal width for Virginica = 0.11
Covariance of petal length for Virginica = 0.1
Covariance of petal width for Virginica = 0.14


### Histograms ###

In [None]:
sns.set_theme(style="white")

sepal_length = sns.displot(df, x ="sepal length", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Sepal length for each Iris variety", xlabel = "Sepal length in cm")
plt.tight_layout()
plt.savefig('sepal_length_hist.png')

sepal_width = sns.displot(df, x ="sepal width", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Sepal width for each Iris variety", xlabel = "Sepal width in cm")
plt.tight_layout()
plt.savefig('sepal_width_hist.png')

petal_length = sns.displot(df, x ="petal length", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Petal length for each Iris variety", xlabel = "Petal length in cm")
plt.tight_layout()
plt.savefig('petal_length_hist.png')

petal_width = sns.displot(df, x ="petal width", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Petal width for each Iris variety", xlabel = "Petal width in cm")
plt.tight_layout()
plt.savefig('petal_width_hist.png')

### Scatterplots ###

### References ###