# Fundamentals of Data Analysis Project

Jaime Lara Carrillo

***

This jupyter notebook contains the project from the module: Fundamentals of Data Analysis from ATU Galway's HDip in Data Analysis.

# Iris Fisher Data Set

## Index

1. [Description](#description)
2. [Project stages and organisation](#project-stages-and-organisation)  
3. [Used tools and libraries](#used-tools-and-libraries)  
4. [Data collection](#data-collection)  
5. [Data analysis](#data-analysis)  
      5.1 [Import and cleaning data](#import-and-cleaning-data)  

***

 ## Description

This project consists of analysing and investigating a multivariate dataset introduced
by Ronald Fisher in his 1936 paper,_The use of multiple measurements in taxonomic problems_.
Although published by Fisher, the data was orignally collected by American botanist, Edgar Anderson,and thats why this is sometimes called Anderson's Iris data set.  
Most of the samples where collected on the same day and in the same area (two of the three species).

This famous dataset could be the ABC of machine learning and data analysis.  

The Iris dataset consists of three species of Iris flowers and 50 samples of each species, giving a sample of a total of 150 records under five attributes, which are:
1. Sepal length(cm)
2. Sepal width(cm)
3. Setal length(cm)
4. Petal width(cm)
5. Class.

The features of this sample are shown in the following image:

![image](https://user-images.githubusercontent.com/110190460/234045122-186ab79b-4fbc-4065-ac3e-017c0e1b97ee.png)

The species are: Setosa, Versicolor and Virginica.  

![image](https://user-images.githubusercontent.com/110190460/234044887-5cf5d38c-8ac7-4846-98d3-bbc213e6f32a.png)

This makes it a multivariate data set, which means that there are two or more variable quantities.
As can be seen, despite having a similar shade of colour, these flowers have different attributes in terms
of the length of their petals and sepals, which in this project are collected in a file with .csv extension, used by the excel spreadsheet.

***

## Project stages and organisation

The steps of this project can be divided into the following tasks: 
* Show which technologies and tools will be used
* Obtain the dataset and download it
* Import the dataset to the integrated development environment (IDE) 
* Review the dataset and avoid incorrect or incomplete data
* Different statistical analyses of the data obtained
* Show graphically the result of the previous analysis
* Drawing a conclusion from the research carried out
* References used in the elaboration of the project


**The project consists of the following files**:  

This notebook file contains the entire project, its development and analysis.  
The images and the database can be found in the corresponding folders.


___

## Used tools and libraries


**The project uses the following technologies**:

* Python version 3.9.2 as programming language
* Excel spreadsheet for the imported data
* Visual Studio Code as the Integrated development environment 
* Jupyter notebook as the folder which contains all the project

**The project uses the following libraries**:

* Numpy  
Is used for numerical computation that allows you to work with large, multi-dimensional arrays and matrices of numerical data. It provides a wide range of mathematical functions to operate on these arrays, making it an essential tool for scientific computing, data analysis, and machine learning. To find out more: https://numpy.org/  

  
* Matplotlib  
Is a plotting library for the Python programming language that provides a wide variety of high-quality 2D and 3D graphs and plots. More of this: https://matplotlib.org/  


* Seaborn  
Is a data visualization library in Python built on top of Matplotlib.  
It provides a high-level interface for creating informative and statistical graphics.  
https://seaborn.pydata.org/  
 
 
* Pandas  
Is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
https://pandas.pydata.org/  
 

Therefore, the totality of the libraries used is as follows:  
```python
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
```

___

## Data collection

To obtain the dataset we can go to the following link:  
https://archive.ics.uci.edu/dataset/53/iris  

However, the file is in .txt format and it has no header, so a number of transformations have to be made to it.  
To simplify things and given the huge popularity of this dataset, you can find the data already with a header  
and in a different format, such as csv. An example is in the following link:  
https://datahub.io/machine-learning/iris#resource-iris  
I used the snake case before importing the data set just to gave a clearer name to the characteristics of each flower.
Once downloaded, the iris flower database is saved in the database folder (files\database\iris_csv.csv).  

***

## Data analysis

#### Import and cleaning data
Now that the data has been imported, it is important to perform a data cleansing, to remove files that may influence  
the result of our analysis, i.e. corrupted data, empty cells, duplicates, incorrect data.  
First of all, in the .py file containing our script it is necessary to import the libraries mentioned above to obtain these new functionalities.  

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 

Once we import the libraries, we read the file named "iris_csv" into a Pandas Dataframe named "df", to access the information it contains.  

In [2]:
df = pd.read_csv("files\database\iris_csv.csv")  

Before working with the data frame, it is necessary to do a data cleaning, as the following problems may arise:
1. Null files
2. Corrupt or atypical files
3. Repeated files
4. Incompatible files

This is a crucial process in data analysis to improve data quality and reliability for analysis. 
It is important for the following reasons:
1. Reliability of results: Clean data ensures that analyses and models are based on accurate and consistent information, leading to more reliable conclusions.
2. Improved accuracy: By eliminating erroneous or inconsistent data, accuracy metrics in models and analyses are improved.
3. Facilitates visualization: Clean data facilitates the creation of clear and understandable visualizations, which helps identify patterns and trends effectively.


A quick way to get general information about the structure and content of the DataFrame is to use the .info() method, which provides an informative summary.

In [3]:
info_iris_db = df.info()
print(info_iris_db)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


The code shows us details such as the number of rows, the number of columns, the types of data in each column and how much memory is being used.  

As for the objective of checking for empty values, it indicates non-null, so the df is complete.

You can also see which values are repeated

In [4]:
duplicates = df.duplicated()
print(duplicates)

0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Length: 150, dtype: bool


The result only shows the first and last rows, so with the .sum() method to count the number of True (duplicated rows) in that series.

In [5]:
duplicate_count = duplicates.sum()
print(f"The number of duplicate rows is: {duplicate_count}")

The number of duplicate rows is: 3


Since these are data pertaining to physical characteristics, it may be common for the same type of flower to have the same size of petal.

For atypical values, with the .describe() method, you obtain descriptive statistics, such as mean, standard deviation, minimum, maximum, and percentiles, which could indicate the presence of outliers.

In [6]:
statistics_summary= df.describe()
print(statistics_summary)

       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
