<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/katas/algorithms/LogisticRegression_Iris.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting exercise generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Kata: Associate Flowers With Their Class with Logistic Regression

In this kata, you'll try to predict a flower's class given some of its characteristics.

[Iris](https://archive.ics.uci.edu/ml/datasets/iris) is a well-known multiclass dataset. It contains 3 classes of flowers with 50 examples each. There are a total of 4 features for each flower.

![](images/Iris-versicolor-21_1.jpg)

## Package setup

In [54]:
# The mlkatas package contains various utility functions required by all katas
!pip install mlkatas



In [15]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import mlkatas

In [16]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 10
%config InlineBackend.figure_format = 'retina'
sns.set()

In [17]:
# Import ML packages (edit this list if needed)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier


## Step 1: Loading the data

In [18]:
# Load the Iris dataset included with scikit-learn
dataset = load_iris()

# Put data in a pandas DataFrame
df_iris = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_iris['target'] = dataset.target
df_iris['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_iris.sample(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,class
54,6.5,2.8,4.6,1.5,1,versicolor
123,6.3,2.7,4.9,1.8,2,virginica
24,4.8,3.4,1.9,0.2,0,setosa
101,5.8,2.7,5.1,1.9,2,virginica
45,4.8,3.0,1.4,0.3,0,setosa
131,7.9,3.8,6.4,2.0,2,virginica
64,5.6,2.9,3.6,1.3,1,versicolor
110,6.5,3.2,5.1,2.0,2,virginica
78,6.0,2.9,4.5,1.5,1,versicolor
103,6.3,2.9,5.6,1.8,2,virginica


### Question

Store training input data in a variable named `x` and training targets in a variable named `y`.

In [19]:
x, y = dataset.data, dataset.target

In [20]:
print(f'x: {x.shape}. y: {y.shape}')
print(f'Labels: {y}')

assert x.shape == (150,4)
assert y.shape == (150,)

x: (150, 4). y: (150,)
Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


## Step 2: Preparing the data

### Question

Split data and labels into the `x_train`, `x_test`, `y_train` and `y_test` variable using a 20% ratio.

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [22]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (120, 4)
assert y_train.shape == (120,)
assert x_test.shape == (30, 4)
assert y_test.shape == (30,)

x_train: (120, 4). y_train: (120,)
x_test: (30, 4). y_test: (30,)


## Step 3: Training a model

### Question
 
Create a `SGDClassifier` instance and store it into the `model` variable. Fit this model on the training data.

In [23]:
# Train a multiclass classifier on the data
model = SGDClassifier(loss="log")
model.fit(x_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

## Step 4: Evaluating the model

In [45]:
# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Training accuracy: 92.50%
Test accuracy: 76.67%


## Step 5: Improve results

### Question

Try your best to improve the model's results. You should obtain accuracys around 90%.

In [60]:
model = SGDClassifier(loss="log", penalty="l1" )
model.fit(x_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l1', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [61]:
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

assert train_acc > .85
assert test_acc > .85

Training accuracy: 95.83%
Test accuracy: 93.33%
