# Exploratory Data Analysis of Iris Data


### Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import train_test as tt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

### Read data

In [2]:
col_names = ["sp_len","sp_wid", "pt_len","pt_wid", "class"]
iris_data = pd.read_csv("../../../data/iris.data",names=col_names)
iris_data.head()

Unnamed: 0,sp_len,sp_wid,pt_len,pt_wid,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Preprocessing

#### Split Target

Target is the variable that we are trying to predict. We need to separate it from the rest of the data for the model.

In [3]:
iris_data.shape[0]

150

I used LabelEncoder() to deal with string target

In [4]:
# target
y = iris_data["class"]
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(y)
y.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: class, dtype: object

In [5]:
# features
X = iris_data.drop(columns=["class"])
X.head()

Unnamed: 0,sp_len,sp_wid,pt_len,pt_wid
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


#### Split into validation and training data

I want to 20% of my data for validation as this is the standard. I set random_state = 1 for reproducibility. 

In [6]:
train_X, val_X, train_y, val_y = train_test_split(X,encoded_labels,test_size=0.2,random_state=1)

### Models

I chose three models that are known to work well with classifical problems: Decision Tree, Random Forest and Categorical Naive Bayes. I trained the models then predicted and evaluated each by calculating the mean absolute values.

#### Model #1: Decision Tree

In [7]:
decision_tree_MAE = tt.decision_tree(train_X, val_X, train_y, val_y)
decision_tree_MAE

0.03333333333333333

#### Model #2: Random Forests

In [8]:
forest_MAE = tt.forest(train_X, val_X, train_y, val_y)
forest_MAE

0.036000000000000004

#### Model #3: Categorical Naive Bayes

In [9]:
bayes_MAE = tt.bayes(train_X, val_X, train_y, val_y)
bayes_MAE

0.03333333333333333

### Results

In [10]:
models = {"Decision Tree": decision_tree_MAE, "Random Forest": forest_MAE, "Categorical Naive Bayes": bayes_MAE}
print(f"The best model out of the ones I choose was {tt.find_smallest_key(models)}")

The best model out of the ones I choose was Decision Tree
