For this task we wish to create a model capable of classifying plant data from the Iris dataset :)

In [55]:
import os
import requests

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

In [35]:
# This downloads the dataset, if it hasn't already been downloaded
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
data_name = 'iris.data'
data_path = os.path.join(os.getcwd(),dataset)

if not os.path.isfile(data_path):
    r = requests.get(data_url)
    with open(data_name, 'wb') as f:
        f.write(r.content)
    

We can see from a sample of our dataset that each of the data features are a floating point number rounded to the nearest decimal place representing the sepal and petal, length and width. An argument could be made that this data is discrete, since they are measures of distance. However, each of these decimal points are clearly not an entirely separate class, eg. 4.3cm is not equally different to 4.2cm as it is to 2.1cm, so they should clearly be treated as continuous.

We wish to use this information to classify the type of flower from a discrete number of classes from the set [Iris-setosa, Iris-versicolor, Iris-virginica], hence it's a classication problem.

In [62]:
# Each data feature represents the sepal and petal, width and height
data_features = ['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
df = pd.read_csv(data_path, names = data_features + ['class'])
print(df.sample())

    sepal_l  sepal_w  petal_l  petal_w            class
78      6.0      2.9      4.5      1.5  Iris-versicolor


Here we transform each of the class names to a one-hot encoded value. We then shuffle the data and split it into appropriately sized training and test sets.

In [None]:
# Finds the unique classes, gives them each an index
to_class = list(set(df['class']))
# Gets an index from the class name
to_idx = {clas : idx for idx, clas in enumerate(to_class)}

# Transform class name to one-hot label
df['class'] = df['class'].apply(lambda clas: to_idx[clas])

# Shuffles data and splits into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(df[data_features], df['class'], test_size=0.2)