<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/training_models/iris.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting notebook generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

# Associate Flowers With Their Class Using Logistic Regression

In this notebook, you'll try to predict a flower's class given some of its characteristics.

[Iris](https://archive.ics.uci.edu/ml/datasets/iris) is a well-known multiclass dataset. It contains 3 classes of flowers with 50 examples each. There are a total of 4 features for each flower.

![Flower](images/Iris-versicolor-21_1.jpg)

## Package setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 10
%config InlineBackend.figure_format = 'retina'
sns.set()

In [3]:
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

scikit-learn version: 0.22.2.post1


## Step 1: Loading the data

In [4]:
# Load the Iris dataset included with scikit-learn
dataset = load_iris()

# Put data in a pandas DataFrame
df_iris = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_iris['target'] = dataset.target
df_iris['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_iris.sample(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,class
118,7.7,2.6,6.9,2.3,2,virginica
98,5.1,2.5,3.0,1.1,1,versicolor
143,6.8,3.2,5.9,2.3,2,virginica
2,4.7,3.2,1.3,0.2,0,setosa
6,4.6,3.4,1.4,0.3,0,setosa
8,4.4,2.9,1.4,0.2,0,setosa
86,6.7,3.1,4.7,1.5,1,versicolor
113,5.7,2.5,5.0,2.0,2,virginica
85,6.0,3.4,4.5,1.6,1,versicolor
144,6.7,3.3,5.7,2.5,2,virginica


### Question

Store training input data in a variable named `x` and training targets in a variable named `y`.

In [7]:
y = df_iris['class']
x = df_iris.iloc[:,0:-2]

In [8]:
print(f'x: {x.shape}. y: {y.shape}')
print(f'Labels: {y}')

assert x.shape == (150,4)
assert y.shape == (150,)

x: (150, 4). y: (150,)
Labels: 0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
5         setosa
6         setosa
7         setosa
8         setosa
9         setosa
10        setosa
11        setosa
12        setosa
13        setosa
14        setosa
15        setosa
16        setosa
17        setosa
18        setosa
19        setosa
20        setosa
21        setosa
22        setosa
23        setosa
24        setosa
25        setosa
26        setosa
27        setosa
28        setosa
29        setosa
         ...    
120    virginica
121    virginica
122    virginica
123    virginica
124    virginica
125    virginica
126    virginica
127    virginica
128    virginica
129    virginica
130    virginica
131    virginica
132    virginica
133    virginica
134    virginica
135    virginica
136    virginica
137    virginica
138    virginica
139    virginica
140    virginica
141    virginica
142    virginica
143    virginica
144    virginica
145    virginica


## Step 2: Preparing the data

### Question

Split data and labels into the `x_train`, `x_test`, `y_train` and `y_test` variable using a 20% ratio.

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [12]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (120, 4)
assert y_train.shape == (120,)
assert x_test.shape == (30, 4)
assert y_test.shape == (30,)

x_train: (120, 4). y_train: (120,)
x_test: (30, 4). y_test: (30,)


## Step 3: Training a model

### Question
 
Create a `SGDClassifier` instance and store it into the `model` variable. Fit this model on the training data.

In [13]:
# Train a multiclass classifier on the data

model = SGDClassifier()
model.fit(x_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

## Step 4: Evaluating the model

In [14]:
# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Training accuracy: 82.50%
Test accuracy: 73.33%


## Step 5: Improve results

### Question

Try your best to improve the model's results. You should obtain accuracys around 90%.

In [15]:
model = SGDClassifier(
    loss='hinge',
    penalty='l2',
    alpha=0.0001,
    l1_ratio=0.15,
    fit_intercept=True,
    max_iter=10000,
    tol=0.001,
    shuffle=True,
    verbose=0,
    epsilon=0.1,
    n_jobs=None,
    random_state=None,
    learning_rate='optimal',
    eta0=0.0,
    power_t=0.5,
    early_stopping=False,
    validation_fraction=0.1,
    n_iter_no_change=5,
    class_weight=None,
    warm_start=False,
    average=False,)
model.fit(x_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=10000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [16]:
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

assert train_acc > .85
assert test_acc > .85

Training accuracy: 94.17%
Test accuracy: 96.67%
