# First look at the Iris dataset

In order to get underway the analysis on this dataset and start this project, it is assumed that I have no prior knowledge of this dataset (_While I have encountered this dataset in other coursework modules this semester_).

The CSV file containing the datset can be downloaded from the UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/dataset/53/iris

In [None]:
#Load the pandas module for the DataFrame
import pandas as pd 

#load local cvs file located in .\resources\iris.data
iris = pd.read_csv("resources/iris.data")

Alternatively, you can load this dataset directly within Python by importing the **datasets** submodule from the **scikit-learn** module  
https://scikit-learn.org/1.4/auto_examples/datasets/plot_iris_dataset.html

In [None]:
#Import the datasets submodule (within the sklearn module)
from sklearn import datasets  

iris = datasets.load_iris()

With that we can begin to look at dataset structure ...  
***

### Collections: Lists, Arrays, Datasets, Dataframes, and Bunches (???)

With the wide array of "container" data types available in Python (and extended modules), it is important to identify what what types we are dealing with when initially opening datasets.  

If you choose the load dataset from a comma-separated values (csv) files with **pandas.read_csv()**, it wil return a two-dimensional data structure from the **pandas** module known as a "dataframe"   
https://pandas.pydata.org/docs/reference/frame.html  
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Alteratively, if you choose the  **sklearn.datasets.load_iris()** approach, it will load the iris dataset and returns by default a Dictionary-like "Bunch" object  
https://scikit-learn.org/0.24/modules/generated/sklearn.utils.Bunch.html

> *By Default, sklearn.datasets.load_iris() loads a Bunch object, but you can load a dataframe if needed with Parameter as_frame=True

This Bunch object contains multiple attributes:

* The data is a **numpy.ndarray** (2D array for features).
* The target is a **numpy.ndarray** (1D array for class labels).
* The feature_names and target_names are lists of **strings**.
* The DESCR is a **str** (detailed description of the dataset).

These data attributes are further explored in more detail in [this notebook](Bunch_attributes.ipynb) 


Below are examples of conversions between the various datatypes:

In [9]:
#load the Bunch object
iris = datasets.load_iris()

# Convert Bunch to DataFrame with the data attribute (numpy.ndarray)
iris_data_frame = pd.DataFrame(iris.data, columns=iris.feature_names)

#Convert Dataframe to numpy ndarray
iris_arr = iris_data_frame.to_numpy()

# Convert the data (numpy ndarray) to a list
iris_data_list = iris.data.tolist()

print (f"iris type is {type(iris)}")
print(f"iris.data type is {type(iris.data)}")
print(f"iris_arr type is {type(iris_arr)}")
print(f"iris_data_frame type is {type(iris_data_frame)}")
print (f"iris_data_list type is {type(iris_data_list)}")



iris type is <class 'sklearn.utils._bunch.Bunch'>
iris.data type is <class 'numpy.ndarray'>
iris_arr type is <class 'numpy.ndarray'>
iris_data_frame type is <class 'pandas.core.frame.DataFrame'>
iris_data_list type is <class 'list'>


***
### Shape of the Iris dataset

Both **pandas.dataframe** and **numpy.ndarray** expose a "shape" property.

If reviewing the shape of the data returned from a **pandas read_cvs()** operation, then the shape is a 2D array of size 150x5

In [30]:
iris_csv = pd.read_csv("resources/iris.csv")
print(iris_csv.shape)

(150, 5)


> *Note: I experienced an issue with another CSV file downloaded from UCI Machine Learning Repository.   
The Shape was only 149,5. It was ignoring the first data row. This was because the file was a "data" file not a "csv" file that did not have the feature names in first header row. As per csv file downloaded from github*  

https://gist.github.com/netj/8836201#file-iris-csv

The Bunch object returned from **sklearn.datasets.load_iris()** contains several attributes besides the core "data".  
Using the numpy.ndarray shape attribute, we can extract size of Iris dataset.  
We can also look at the "target" data (also a numpy.ndarray)

The "data" attribute is a 2D array of 150 x 4 floating point data points (150 data items with 4 attributes)  
The "target" attribute is  1D array of 150 x 1 data points with enumerations codes 1,2, or 3 (150 data items with a single attribute)

In [16]:
#view the shape of data and target attributes
print(iris.data.shape)
print(iris.target.shape)

(150, 4)
(150,)


To get further context on what the 4 data attributes are ...   
The Bunch object has "feature_names" with 4 data records that gives names the dataset's features.  

While **feature_names** lists the names of the dataset's features, the **target_names** lists the names of the classes (species) enumerated in the dataset's target


In [None]:
#view the raw data in feature_names and target_names attributes
print(iris.feature_names)
print(iris.target_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']


Finally, we have the DESCR Attribute that is a long string that contains a full detailed description of the iris dataset.   
The size of the DESCR string is 1065 characters

In [20]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

***
### Null Values

We should test this Dataset for any null values ...  
While dataframes support an isnull() method, numpy arrays do not support the isnull() method.
When testing for nulls 

In [43]:
#load the iris dataframe
iris_df = pd.read_csv("resources/iris.csv")

# Check for missing values and sum over the 4 features
null_values = iris_df.isnull().sum()

print(f"Number of Null Values :\n{null_values}")

Number of Null Values :
sepal.length    0
sepal.width     0
petal.length    0
petal.width     0
variety         0
dtype: int64


***
# First Look Conclusion
The iris dataset can loaded using the **pandas.read_csv()** or **sklearn.datasets.load_iris()** method.   

The iris dataset extracted from the  **sklearn.datasets.load_iris()** is not a simple array, list, or dictionary.  
Its contains the following attributes :   
   - `data` → **`numpy.ndarray`** (2D array with features)
   - `target` → **`numpy.ndarray`** (1D array with target labels)
   - `feature_names` → **`list`** (list of feature names)
   - `target_names` → **`list`** (list of target class names)
   - `DESCR` → **`str`** (string description)

The core data that was collected by Ronald Fisher in 1936 is contained with the 'data' ndarray.  
It is 2D array of 150 x 4 data points with no null values. *i.e 4 specific features were sampled on 150 plants.*
1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)   

These 4 features are visualized on below image of a typical Iris flower :   
![alt text](images\sepals_and_petals_600w.webp)

The 4 names of each these features are listed with the feature_names attribute


The result of the statistical classification done on this core data is stored in the target attribute.  
Based on the resultant measurements taken on these four features, statistical or machine learning methods could distinguish 3 unique species.
Each of the 150 measured plants were classified into 3 distinct classes or species :
1. setosa
2. versicolor
3. virginica

The target attribute array stores a number (1-3) where each number represents the class or species name

**Any processing on core 'data' or 'target' arrays need both feature_names and target_names to give data contextual meaning.**