# Classifying Iris Species

In [None]:
from preamble import *

In [4]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

### Investigate Dataset

In [5]:
print(f"Keys of the iris_dataset: \n{iris_dataset.keys()}")

Keys of the iris_dataset: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


In [6]:
print(iris_dataset['DESCR'][:500] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                

...


In [7]:
print(f"Target names: {iris_dataset['target_names']}")

Target names: ['setosa' 'versicolor' 'virginica']


In [8]:
print(f"Feature names: {iris_dataset['feature_names']}")

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [9]:
print(f"Type of data: {type(iris_dataset['data'])}")

Type of data: <class 'numpy.ndarray'>


In [10]:
print(f"Shape of data: {iris_dataset['data'].shape}")

Shape of data: (150, 4)


The array contains measurements for 150 different flowers. Individual items are called *samples* in machine learning, while their properties are called *features*. The *shape* of the data is the number of samples multiplied by the number of features.

In [11]:
print(f"First five columns of data: \n {iris_dataset['data'][:5]}")

First five columns of data: 
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


First five flowers have a petal width of 0.2 cm and that the first flower has the longest sepal, at 5.1 cm.

The *target* array contains the species of each of the flowers that were measured, also as a NumPy array:

In [12]:
print(f"Type of target: {type(iris_dataset['target'])}")

Type of target: <class 'numpy.ndarray'>


In [13]:
print(f"Shape of target: {iris_dataset['target'].shape}")

Shape of target: (150,)


The species are encoded as integers from 0 to 2:

In [14]:
print(f"Target:\n{iris_dataset['target']}")

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


The meanings of the numbers are given by the `iris['target_names']` array:
0 means *setosa*, 1 means *versicolor*, and 2 means *virginica*.

### Measuring Success: Training and Testing Data
We cannot use all the data used to build the model to evaluate it. The training set who simply remember the whole training set. This does not indicate that our model will *generalize* well. To assess, the model's performance, we will show it new data.

> Split data into *train* and *test* sets

Data is usually denoted with a capital *X* while lowercase *y* denotes labels. X is capital as it is a multidimensional array (matrix) while y is a one-dimensional array (vector).

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

In [16]:
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

X_train shape: (112, 4)
y_train shape: (112,)


In [17]:
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

X_test shape: (38, 4)
y_test shape: (38,)


### Look at the Data

In [18]:
# create a dataframe from the data in X_train
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15,15),
                                 marker='o', hist_kwds={'bins':20}, s=60,
                                 alpha=.8, cmap=mglearn.cm3)

NameError: name 'pd' is not defined