# Predicting Iris Species - Preprocessing

In [6]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()

The iris dataset above contains a dictionary with the following values:
- DESCR: a description of the data including information about the attributes and target variable
- data: an array containing the attributes and their values
     - Sepal length
     - Sepal width
     - Petal length
     - Petal width
- target: an array containing the target variable, i.e., the Iris flower species
     - 0: Setosa
     - 1: Versicolor
     - 2: Virgnica

To facilitate the exploration of the dataset, we will build a Pandas DataFrame with all the attributes and target variable. We can create a function:

In [7]:
def create_iris_df(iris_dataset):
    # Create a list with all data
    iris_all_data= []

    for i, x in enumerate(iris_dataset.data):
        iris_all_data.append(x.tolist() + [iris_dataset.target[i]])

    # Create a DataFrame from the list
    iris_df= pd.DataFrame(
        iris_all_data,
        columns=iris_dataset.feature_names + ["species_num"]
    )

    return iris_df

In [8]:
iris_df= create_iris_df(iris)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species_num
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [13]:
iris_df_viz = iris_df.copy()

iris_df_viz['species']= iris_df_viz['species_num'].apply(lambda x: iris.target_names[x])
iris_df_viz= iris_df_viz.drop('species_num', axis=1)

iris_df_viz.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [14]:
iris_df_viz.to_csv('data/iris-species.csv', index=False)