# Task 2: Explore the Data Structure

Tasks based on instructions given in document https://github.com/ianmcloughlin/principles_of_data_analytics/blob/main/assessment/tasks.md

<font color='sky blue'>"Print and explain the shape of the data set, the first and last 5 rows of the data, the feature names, and the target classes."</font>

In [30]:
#import required modules 
import pandas as pd
import numpy as np
from sklearn import datasets as ds

#load the Iris Dataset
iris = ds.load_iris()



Once Iris dataset is loaded we can apply a series of *sklearn dataset* Operations to explore dataset.

In [2]:
#break down the content of the dataset and its structure
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

We can view the the contents of each of these "keys" and extract a contextual meaning

In [3]:
#the raw data
iris["data"]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

This 'data' component is a 2D array of floating point numbers
The Array size is 150x4

The sampled measurements on 150 Flowers are recorded in the rows of this dataset.
The inner 4 data elements (Columns) are measurements of a specific characteristic of the Iris Flower

To get a better context on what characteristics of the Iris flowers that are measured, we can use the 'feature_names' key of dataset.

In [4]:
#Extract the characteristics measured in dataset
iris['feature_names']


['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

The keys 'target' and 'target_names' can used to see how the Iris flower is classified.
The 'target' component is an array of 150 int32 numbers (for each measured flower) that are indexes for the 'target_names' collection

In [None]:
#review target and target_names classifications of dataset
print(iris['target_names'])
print(iris['target'])


['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


By converting dataset into a pandas *DataFrame*, we can extract the value of the "shape" attribute.
This "shape" attriburte tells us that the dataframe array is 150 rows of 5 columns

In [16]:
# Convert to DataFrame
dframe = pd.DataFrame(iris.data, columns=iris.feature_names)

# Add the target (species)
dframe['Species'] = iris.target

# Get the shape of the dataset
print(dframe.shape)

(150, 5)


In converting to a Pandas dataframe we can also list subsets of the data.</br>
For example the first or last n rows ....

In [28]:
#display first 5 rows (Number of rows to select defaults to 5.)
dframe.head()


#display first 10 rows 
dframe.head(10)

#display last 10 rows 
dframe.tail(10)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
140,6.7,3.1,5.6,2.4,2
141,6.9,3.1,5.1,2.3,2
142,5.8,2.7,5.1,1.9,2
143,6.8,3.2,5.9,2.3,2
144,6.7,3.3,5.7,2.5,2
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


## <font color='crimson'>I am curious as to what the need for the 'Target' component is.</font></br>
Why does the Iris dataset have a separate 'data' array of 150x4, and an additional 'Target' array of 150x1 </br>
Why not just a single 150x5 array ?
</br></br>
According to ChatGPT:</br>
#### Why Separate data and target?</br>
There are a few reasons for this separation:

1. Clarity and Convention:
In many machine learning tasks, it is standard to separate the features (inputs) and the target (output/labels) into different components. This is because:
Features and targets often serve different purposes in model training. The features are what the model uses to make predictions, while the target is the ground truth the model aims to predict.
Keeping them separate makes it clear which data is being used as input and which data is being used as output.

2. Flexibility:
The separate arrays allow for greater flexibility. You can mix and match the data with different target labels or perform tasks like cross-validation more easily.
You might want to modify the target labels, handle them differently, or train the model on only part of the dataset without affecting the features.

3. Compatibility with Machine Learning Libraries:
Most machine learning libraries, including Scikit-learn, expect the features and target to be separate. For instance, when using Scikit-learn's model training methods, you usually pass the features (X) and the target (y) separately:

4. Maintainability and Modularity:
The separation of data and target allows the dataset to be more modular. For example, you can have multiple target variables in multi-output problems (e.g., a dataset with multiple target columns). Keeping the input data separate from target data provides cleaner, more maintainable code.

5. Possible Multi-Target Datasets:
In some cases, the dataset may have more than one target (e.g., predicting multiple variables). By keeping the features and targets separate, it becomes easier to add more target variables if needed.
python
Copy


To convert the **Iris dataset into a NumPy array**, you can easily access the data and target attributes of the dataset and combine them into a single array if needed.</br>
Above Dataframe *dframe* cannot be accessed via *dframe[0][2]*


In [None]:
#Convert DataFrame to a numpy array
numpy_arr = dframe.to_numpy()

numpy_arr[0][2]

1.4