# Associate flowers with their classes

In this activity, you'll try to predict a flower's class given some of its characteristics.

[Iris](https://archive.ics.uci.edu/ml/datasets/iris) is a well-known multiclass dataset. It contains 3 classes of flowers with 50 examples each. There are a total of 4 features for each flower.

![Flower](images/Iris-versicolor-21_1.jpg)

## Environment setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 10
%config InlineBackend.figure_format = 'retina'
sns.set()

In [3]:
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

scikit-learn version: 0.23.2


## Step 1: Loading the data

In [4]:
# Load the Iris dataset included with scikit-learn
dataset = load_iris()

# Put data in a pandas DataFrame
df_iris = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_iris['target'] = dataset.target
df_iris['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_iris.sample(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,class
96,5.7,2.9,4.2,1.3,1,versicolor
127,6.1,3.0,4.9,1.8,2,virginica
129,7.2,3.0,5.8,1.6,2,virginica
6,4.6,3.4,1.4,0.3,0,setosa
147,6.5,3.0,5.2,2.0,2,virginica
124,6.7,3.3,5.7,2.1,2,virginica
50,7.0,3.2,4.7,1.4,1,versicolor
110,6.5,3.2,5.1,2.0,2,virginica
48,5.3,3.7,1.5,0.2,0,setosa
101,5.8,2.7,5.1,1.9,2,virginica


### Question

Store training input data in a variable named `x` and training targets in a variable named `y`.

In [5]:
# BEGIN SOLUTION CODE
x = dataset.data
y = dataset.target
# END SOLUTION CODE

In [6]:
print(f'x: {x.shape}. y: {y.shape}')
print(f'Labels: {y}')

assert x.shape == (150,4)
assert y.shape == (150,)

x: (150, 4). y: (150,)
Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


## Step 2: Preparing the data

### Question

Split data and labels into the `x_train`, `x_test`, `y_train` and `y_test` variable using a 20% ratio.

In [7]:
# BEGIN SOLUTION CODE
# Split data between training and test sets with a 20% ratio
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)
# END SOLUTION CODE

In [8]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (120, 4)
assert y_train.shape == (120,)
assert x_test.shape == (30, 4)
assert y_test.shape == (30,)

x_train: (120, 4). y_train: (120,)
x_test: (30, 4). y_test: (30,)


## Step 3: Training a model

### Question
 
Create a `SGDClassifier` instance and store it into the `model` variable. Fit this model on the training data.

In [9]:
# Train a multiclass classifier on the data

# BEGIN SOLUTION CODE
model = SGDClassifier(loss='log')
model.fit(x_train, y_train)
# END SOLUTION CODE

SGDClassifier(loss='log')

## Step 4: Evaluating the model

In [10]:
# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Training accuracy: 75.00%
Test accuracy: 73.33%


## Step 5: Improve results

### Question

Try your best to improve the model's results. You should obtain accuracys around 90%.

In [11]:
# BEGIN SOLUTION CODE
# Standardize data
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

model.fit(x_train, y_train)
# END SOLUTION CODE

SGDClassifier(loss='log')

In [12]:
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

assert train_acc > .85
assert test_acc > .85

Training accuracy: 95.83%
Test accuracy: 96.67%
