## Fundamentals of Data Analysis - Project ##

**Name: James McEneaney** <br/><br/>
**Course: Higher Diploma in Computing in Data Analytics, ATU Ireland** <br/><br/> 
**Semester: Semester 2 2023** <br/><br/>


***

## Introduction ##

This project sets out to analyse the data points and variables within Fisher's Iris Dataset. For each variable, the focus will be on the classifcation of the variable according to it's type as represented in Python code, utilising appropriate summary statistics to analyse the variable and displaying variables using the most appropriate plots.

I am using Visual Studio Code (VS Code) (version 1.83.1) to write my project and to upload it to my repository on GitHub for assessment. I am also using Jupyter code cells within VS Code.

This file will firstly give some background information to the dataset in question. 

I will then outline the steps which I needed to take before I could begin the actual analysis itself: downloading the dataset, preparing the dataset by adding the correct headings, and importing the modules, libraries and packages which I used to complete the project.

Next, I will outline some summary statistics relating to the variables within the dataset. In addition to the summary statistics generated by the .describe() method, I calculated, for the dataset as a whole, figures for coefficient of variation and skewness. I also calculated the correlation coefficients for the six possible pairs of traits. I have included a description of these statistics in the 'Summary of each variable' section and my interpretation of the results. My script will redirect these summary statistics for each of the four variables (ie. traits) on to a text file.

I will then carry out data visualisation on the Iris dataset, using histograms and scatterplots. My script will save each plot generated from the data as a .png file and these can be viewed below. I will discuss my interpretations of the histograms and scatterplots.

My project will conclude with an overall summary of my findings and my thoughts upon conclusion of the project. I will also provide a list of references (using APA reference style) which I used to complete my work.


### Background of dataset ###

The data set was collected in 1935 by the American botanist Edgar Anderson and used in 1936 by the British statistician and biologist Ronald A. Fisher. It relates to data collected from samples of three species of the Iris flowering plant genus: Iris setosa, Iris virginica, and Iris versicolor. \
It is commonly used as an introductory data set by people who are learning how to analyse and visualise data using programming languages.

Fifty samples were collected for each species, giving one hundred and fifty samples in total. \
For each sample, four features of the flower were measured; these were: sepal length, sepal width, petal length and petal width. These attributes of the samples are contained in columns 1, 2, 3 and 4 respectively within the dataset. The species name of the flower is also included in the dataset in column 5.

The petal of a flowering plant are the leaves of the flower which surround the reproductive parts of the flower, and which are often brightly coloured to attract pollinators. Sepals usually protect the flower when it is in a bud and structurally support the petals when the flower is in bloom.

The below is an image of the three flowers analysed in the dataset, along with a label for the sepal and petal of one of the three flower species (Iris Versicolor):

![image](https://raw.githubusercontent.com/jmce22/pands-project/main/iris_flowers.png)

In [12]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [18]:
df = pd.read_csv("iris.data.csv", names=['sepal length', 'sepal width', 'petal length', 'petal width', 'variety']) 


In [17]:
print(df)

     sepal length  sepal width  petal length  petal width         variety
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


### Variable types in data set ###

Fisher's Iris dataset consists of five variables: sepal length, sepal width, petal length, petal width and the variety of the Iris flower. To determine what "type" Python classifies each variable as, we can use the function df.dtypes:

In [None]:
print(df.dtypes)

We find that each of the numerical variables are classified as floating point numbers. Additionally, we know that each of these four variables are *ratio* variables as opposed to *interval* variables; for a ration variable, a value of zero corresponds to an absence of that variable, while for an interval variable, a value of zero can be just a point on the scale of measurement.

### Summary statistics ###

In [None]:
summary_stats = df.describe()
transpose_summary = np.transpose(summary_stats)      # this makes the summary statistics easier to read
summary= transpose_summary
print(summary)


I also want to meausre a statistic called the coefficient of variation: this measures the relative dispersion of data points in a data series around the mean and is measured by dividing the standard deviation by the mean. Firstly we calculate the standard deviation and mean for each variable and secondly we used these figures to calculte the coefficient of variation:

In [None]:
meansl = df["sepal length"].mean()
sdsl = df["sepal length"].std()
meansw = df["sepal width"].mean()
sdsw = df["sepal width"].std()
meanpl = df["petal length"].mean()
sdpl = df["petal length"].std()
meanpw = df["petal width"].mean()
sdpw = df["petal width"].std()

cov_sl = (sdsl/meansl)
cov_sw = (sdsw/meansw)
cov_pl = (sdpl/meanpl)
cov_pw = (sdpw/meanpw)

# rounding to 2 decimal places
round_cov_sl = round(cov_sl, 2)
round_cov_sw = round(cov_sw, 2)
round_cov_pl = round(cov_pl, 2)
round_cov_pw = round(cov_pw, 2)

print(f'Sepal length: CoV = {round_cov_sl}')
print(f'Sepal width: CoV = {round_cov_sw}')
print(f'Petal length: CoV = {round_cov_pl}')
print(f'Petal width: CoV = {round_cov_pw}')

In [None]:
setosa = df.loc['setosa']

print()

### Histograms ###

In [None]:
sns.set_theme(style="white")

sepal_length = sns.displot(df, x ="sepal length", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Sepal length for each Iris variety", xlabel = "Sepal length in cm")
plt.tight_layout()
plt.savefig('sepal_length_hist.png')

sepal_width = sns.displot(df, x ="sepal width", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Sepal width for each Iris variety", xlabel = "Sepal width in cm")
plt.tight_layout()
plt.savefig('sepal_width_hist.png')

petal_length = sns.displot(df, x ="petal length", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Petal length for each Iris variety", xlabel = "Petal length in cm")
plt.tight_layout()
plt.savefig('petal_length_hist.png')

petal_width = sns.displot(df, x ="petal width", bins = 20, hue ="variety", palette = "Set1_r", multiple = "stack").set(title = "Petal width for each Iris variety", xlabel = "Petal width in cm")
plt.tight_layout()
plt.savefig('petal_width_hist.png')

### Scatterplots ###

***

### End ###