## Cleaning and Exploratory Data Analysis

We are cleaning the data, then performing exploratory data analysis on the iris data in this notebook.

In [9]:
## import packages

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from IPython.display import display


In [2]:
iris = pd.read_csv('/home/jul-ian/Github/iris-practice/data/processed/iris.csv')

iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Based on the first few rows above, we can see we are working with only four features: the length and width of both the sepals and the petals. The class colunm contains the target. Below we can see a frequency table of the target variable, which allows us to see that we have three classes (setosa, versicolor, virginica). Of note is that the classes are perfectly balanced with 50 observations per class.

After this, we take a look at the numeric features we will use as predictors. From the description table, we can see that there are no missing values for any of the variables. Looking at the distributions, nothing stands out as being problematic.

In [3]:
iris['class'].value_counts().to_frame()

Unnamed: 0,class
Iris-setosa,50
Iris-versicolor,50
Iris-virginica,50


In [4]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


At this point, I think we have seen enough to ensure that the data is of passing quality. Therefore, we will split the train and test data here, so that we do not bias any models that we create moving forward. Before we do this, the target variable of will be converted to integer class so that it can be used in any ML models.

In [8]:
class_map = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}

iris['target'] = iris['class'].map(class_map)

In [10]:
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].to_numpy()
y = iris['target'].to_numpy()

print(X.shape)
print(y.shape)

np.save('/home/jul-ian/Github/iris-practice/data/processed/X.npy', X)
np.save('/home/jul-ian/Github/iris-practice/data/processed/y.npy', y)

(150, 4)
(150,)
