# Iris Dataset Project
##### Iam Student, #123456789
##### Monday 1 May 2023

## Introduction

The Iris, a type of flowering plant, was extensively studied by botanist Edgar Anderson in the Gaspé Peninsula, where he documented various characteristics of numerous iris flowers in a digital format. This dataset gained recognition following its use by biologist and statistician Ronald Fisher in his 1936 paper, *The Use of Multiple Measurements in Taxonomic Problems*. Fisher leveraged this dataset to illustrate the applicability of statistics in classification tasks. He suggested that distinct attributes among the species in this dataset, particularly sepal and petal measurements, could serve as reliable indicators for the identification of different iris groups.   For these reasons the Iris dataset is popular within Data Science and Machine Learning communities. It's often likened to the "Hello World" program of programming but in the context of data science.

The objective of this report is to explore the hypothesis that this dataset can be utilized for species classification, and to elucidate the reasons for its popularity among the Data Science and Machine Learning communities.

## Methodology

### Data Collection

Data for this project can be obtained from various sources. In this particular instance, the required data was procured from [2]. Following the acquisition, the data was incorporated into a pandas dataframe, positioning it for utilization in subsequent stages.



In [None]:
!wget -O iris.csv https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data > /dev/null 2>&1

### Data preperation

This involves cleaning up the data, if required - removing errors/duplicates and dealing with null values.  For the Iris dataset this is not necessary, but pandas functions `isnull()` were used to verify the dataset was complete.

### Data Exploration

Exploratory Data Analysis allows us to better understand the data through statistical and visual analysis in order to form hypotheses and uncover potential patterns in the data [3] 

For the exploratory portion of the project, the pandas python library was used. To create the visualisation a mixture of *matplotlib*, *seaborn*, and *pandas* was used. The initial analysis and testing of the code was develped in **exploration_development.ipynb** notebook for this exploration.





## Analysis 
*Pandas* allows us to see that the dataset is comprised of 150 rows and 5 columns; 4 of these columns are float datatypes containing the petal and sepal measurements and the last one is an object datatype that contains the species names. It contains five columns namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. There are no null values in the dataset that need to be accounted for in later analysis. Each species has 50 samples.




In [None]:
import pandas as pd
columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species']
df = pd.read_csv('iris.csv', names=columns)
print('Table 1: Species Counts')
df["Species"].value_counts()

Below is a sample of 10 entries in the Iris Dataset:



In [None]:
print('Table 2: Random Sample of 10 values fromt eh Iris Dataset')
df.sample(10)

The *describe* function in *pandas* shows some basic statistics such and means, standard deviations and medians. 




In [None]:
print('Table 3: Statistical Summary of the Iris Dataset')
df.describe()

## Visualisations
It may be beneficial at this stage to visualise the data and the relationships between measurements to see if any patterns emerge. For this I utilised a *seaborn* pair plot which pairs every feature with every other feature, distinguished by object-type. A pair plot outputs a mixture of two-dimensional scatter plots, that show the relationships between measurements, and univariate histograms that show the distribution of each measurement separated by species.



In [None]:
import seaborn as sb
# Pair Plot
print('Figure 1: Pairwise plot of Iris Dataset')
sb.pairplot(df, hue='Species')



To see how each measurement is distributed by species, a box plot demonstrate distribution differences between petal and sepal values. Petal measurements occupy a much smaller range per species than sepal measurements, which are more spread out. The larger distribution of sepal measurements means there are more outliers and thus more chances for the data to overlap between species, making classification based on sepal measurements more difficult. This, along with the density of petal data, provides further support to the hypothesis from the previous section that suggests the distinctions between species are likely petal-based rather than sepal-based. 


### Correlations
It appears that a distinction exists between petal and sepal measurements so perhaps there is some internal consistency within them. The *pandas correlation* function tells us more about this:



In [None]:
print('Table 4: Correlation Matrix of Iris Dataset')
df.corr()


The correlations can be visualised more easily on the heatmap below. The large amount of red and orange squares show that the majority of measurements are highly correlated with one another, except for sepal width.

In [None]:
sb.heatmap(df.corr(), annot=True)

## Discussion

From Table 3 we can see the following:
* Mean values show that sepal and petal lengths are larger than their respective width measurements and that sepals run larger than petals. So, if we had never seen an iris flower before, we can assume from these figures that petals are smaller than sepals and that both are generally longer than they are wide.
* The mean and median values of both sepal measurements are quite similar and the low standard deviations [4] suggest this data may be quite clustered together with datapoints relatively close to the mean. 
* On the other hand, the mean and median values of both petal measurements are not as close and petal standard deviations are slightly higher than their sepal counterparts, suggesting that there is more variance in the petal data.
* The previous two observations suggest that any differences between iris species may be more likely related to petal features rather than sepal.

From Figure 1 we see that iris setosa (blue) appears visually to be quite separate from the other two species in virtually all scatter plot feature combinations, but most drastically in petal measurements. While there is quite a bit of observable overlap between versicolor (orange) and virginica (green), particularly in terms of sepal measurements, setosa appears to be significantly linearly distinct. Petal length and width and sepal length in the setosa are significantly smaller than those of either versicolor or virginica, as we can see in the histograms where the setosa data is much closer to the left of each graph.    

Also, although versicolor and virginica are not cleanly distinct from one another, again the petal measurements demonstrate a pattern with virginica tending to have longer and wider petals than versicolor. Sepal measurements for these species are much more clustered when examined alone but when paired with petal measurements, distinctions can be seen as those with larger petals seemingly tend to also have larger sepals.

From Table 3 we can see that:
* Petal length and width are very highly positively correlated (r = 0.96), which tells us that as one gets larger so does the other, indicating that petal length and width have a close relationship.
* On the other hand, sepal measurements have a very weak relationship with one another (r = -0.1).
* Both petal length and width have very strong positive correlations with sepal length (r = 0.87 and 0.82 respectively) indicating that as both get larger, so does sepal length.
* Both petal length and width have fairly weak negative correlations with sepal width (r = -0.42 and -0.36 respectively) indicating there is not much of a relationship between these features. 
* These correlations tell us that sepal width is not moderately or highly correlated with any other measurement.





## Conclusion
The data analysis performed in this investigation uncovered many interesting details about the iris dataset but the main points of note are:
* Iris setosa is linearly distinguishable from both versicolor and virginica. 
    * If measurements are presented of an iris with short, narrow petals and short but wide sepals, it could be reliably predicted that the particular species is setosa.
* Species differentiation largely depends more on petal measurements than sepal measurements.
    * Versicolor and virginica are not very distinguishable from one another in terms of sepal measurements but looking at the petal data, virginica irises seem more likely to have longer, wider petals than versicolor. 

The patterns identified in the data analysis demonstrate why the Iris Dataset is a popular choice for machine learning tutorials; the separability of the species makes the building of a predictive model relatively easy. A machine learning program learns from data provided by previous examples



## References

[1] Pandad Library, https://pandas.pydata.org/

[2] Iris Dataset, https://archive-beta.ics.uci.edu/dataset/53/iris

[3] Douieb, 2017, https://www.quora.com/What-are-the-steps-include-in-data-exploration. 

[4] https://www.researchgate.net/post/What_do_you_consider_a_good_standard_deviation

