<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/classic-datasets/Iris.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>

# Guided ML With The Iris Dataset

| Learning type | Activity type | Objective |
| - | - | - |
| Supervised | Multiclass classification | Identify a flower's class |

Contents:
1. Loading the data
2. Setting up supervised learning problem (selecting features)
3. Creating a first model
    - Creating train and test datasets
    - Normalizing train and test
    - Fitting and predicting
4. Evaluate the frist model predictions
5. Crossvalidation of the model
6. Creating an end to end ML pipeline
    - Train/Test Split
    - Normalize
    - Crossvalidations
    - Model
    - fitting and predicting

## Instructions with NBGrader removed

Complete the cells beginning with `# YOUR CODE HERE` and run the subsequent cells to check your code.

## About the dataset

[Iris](https://archive.ics.uci.edu/ml/datasets/iris) is a well-known multiclass dataset. It contains 3 classes of flowers with 50 examples each. There are a total of 4 features for each flower.

![](./classic-datasets/images/Iris-versicolor-21_1.jpg)

## Package setups

1. Run the following two cells to initalize the required libraries. 

In [2]:
#to debug package errors
import sys
sys.path
sys.executable

'/home/roy/.conda/envs/MachineLearning/bin/python'

In [38]:
# Import needed packages
# You may add or remove packages should you need them
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# Set random seed
np.random.seed(0)

# Display plots inline and change plot resolution to retina
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Set Seaborn aesthetic parameters to defaults
sns.set()

## Step 1: Loading the data

1. Load the iris dataset using ```datasets.load_iris()```
2. Investigate the data structure with ```.keys()```
3. Construct a dataframe from the dataset
4. Create a 'target' and a 'class' column that contains the target names and values
5. Display a random sample of the dataframe 

In [4]:
iris = datasets.load_iris()
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [5]:
for k in iris.keys():
    try:
        print(len(iris[k]), k)
    except:
        print(iris[k], k)

150 data
150 target
None frame
3 target_names
2782 DESCR
4 feature_names
96 filename


In [6]:
def classify(num):
    if num == 0:
        return 'setosa'
    elif num == 1:
        return 'versicolor'
    elif num == 2:
        return 'virginica'

In [7]:
df = pd.DataFrame(iris['data'])
df['target'] = iris['target']
df['class'] = [classify(x) for x in iris['target']]
df

Unnamed: 0,0,1,2,3,target,class
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,virginica
146,6.3,2.5,5.0,1.9,2,virginica
147,6.5,3.0,5.2,2.0,2,virginica
148,6.2,3.4,5.4,2.3,2,virginica


### Question
Find the X and y values we're looking for. Notice that y is categorical and thus, we could **one-hot encode it** if we are looking at **class** or we can just pick **target**. In order to one hot encode we have  to re-shape `y` it using the **.get_dummies** function. 

#### For the purpose of this exercise, do not use hot encoding, go only for target but think about if you have to drop it somewhere or not...

In [140]:
# YOUR CODE HERE
from sklearn import ensemble
y = df['class']
X = df.drop(['target', 'class'], axis=1)

## Step 2: Setting up supervised learning problem (selecting features)

Feature selection is an essential step in improving a model's perfromance. In the first version of the model we will use the 'sepal length' and 'sepal width' as predicting features. Later we will see the effect of adding additional features.

1. Assign the values of the 'target' to Y as a numpy array
2. Assign the remaining feature values to X as a numpy array
3. Check the shape of X and Y. Check the first few values.
    - Can we confirm our X and Y are created correctly?

In [141]:
y = np.array(y)
X = np.array(X)
print(y[:10])
y.shape

['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa']


(150,)

In [142]:
print(X[:10])
X.shape

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


(150, 4)

## Step 3: Creating the first model

In lecture we learned about creating a train and test datasets, normalizing, and fitting a model. In this step we will see how to build a simple version of this.

We have to be careful when constructing our train and test datasets. First, when we create train and test datasets we have to be careful that we always have the same datapoints in each set. Otherwise our results won't be reproduceable or we might introduce a bias into our model.

We also need to be attentive to when we normalize the data. What would be the effect of normalizing the data (i.e. with StandardScaler to a range between 0 - 1) before we create our train and test sets? Effectively we would use information in the test set to structure the values in the training set and vice versa. Therefore normalizing train and test independently is the preferred method.

1. Create X_train, X_test, Y_train, Y_test using ```train_test_split()``` with an 80/20 train/test split. Look in the SKLearn documentation to understand how the function works.
    - Inspect the first few rows of X_train.
    - Run the cell a few times. Do the first few rows change?
    - What option can we use in ```train_test_split()``` to stop this from happening?
2. Normalize the train and test datasets with ```StandardScaler```
    - We can fit the transform with ```.fit()``` and ```.transform()``` to apply it. Look in the documentation for an esample of how to do this.
    - Does it make sense to normalize Y_train and Y_test?
3. Initalize a ```LogisticRegression()``` model and use the ```.fit()``` method to initalize the first model.
    - We will pass the X_train and Y_train variables to the ```.fit()``` method.
    - Once the model is fit, use the ```.predict()``` with the X_test and save the output as predictions.

In [177]:
#split train and test data 80/20
#your code here
from sklearn.model_selection import train_test_split, cross_validate

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)
print(len(X), len(y), len(X_train), len(y_train))
X_train[:10]

150 150 75 75


array([[6.7, 3.3, 5.7, 2.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.7, 3.1, 5.6, 2.4],
       [5.1, 3.4, 1.5, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [5.1, 3.5, 1.4, 0.2],
       [7.7, 3.8, 6.7, 2.2],
       [5.7, 2.8, 4.1, 1.3],
       [7.7, 2.6, 6.9, 2.3],
       [5.8, 2.7, 5.1, 1.9]])

In [178]:
#normalize the dataset
#your code here
scaler = preprocessing.StandardScaler()
        
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [179]:
#initalize and fit with Logistic Regression
#your code here
clf = LogisticRegression().fit(X_train, y_train)
print(y_test)
clf.predict(X_test)

['versicolor' 'versicolor' 'virginica' 'setosa' 'virginica' 'virginica'
 'virginica' 'versicolor' 'setosa' 'versicolor' 'versicolor' 'setosa'
 'versicolor' 'versicolor' 'versicolor' 'virginica' 'virginica'
 'virginica' 'setosa' 'setosa' 'setosa' 'versicolor' 'virginica' 'setosa'
 'virginica' 'versicolor' 'versicolor' 'versicolor' 'setosa' 'virginica'
 'versicolor' 'setosa' 'versicolor' 'versicolor' 'versicolor' 'virginica'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'versicolor' 'setosa'
 'versicolor' 'versicolor' 'setosa' 'virginica' 'virginica' 'setosa'
 'virginica' 'versicolor' 'setosa' 'versicolor' 'virginica' 'virginica'
 'virginica' 'versicolor' 'setosa' 'versicolor' 'virginica' 'versicolor'
 'virginica' 'setosa' 'setosa' 'versicolor' 'virginica' 'setosa'
 'virginica' 'virginica' 'setosa' 'setosa' 'virginica' 'versicolor'
 'setosa' 'setosa']


array(['versicolor', 'versicolor', 'virginica', 'setosa', 'virginica',
       'virginica', 'virginica', 'versicolor', 'setosa', 'versicolor',
       'versicolor', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'virginica', 'virginica', 'setosa', 'setosa',
       'setosa', 'virginica', 'virginica', 'setosa', 'virginica',
       'versicolor', 'virginica', 'versicolor', 'setosa', 'virginica',
       'virginica', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'versicolor', 'setosa', 'versicolor', 'versicolor', 'setosa',
       'virginica', 'virginica', 'setosa', 'virginica', 'versicolor',
       'setosa', 'versicolor', 'virginica', 'virginica', 'virginica',
       'versicolor', 'setosa', 'versicolor', 'virginica', 'versicolor',
       'virginica', 'setosa', 'setosa', 'versicolor', 'virginica',
       'setosa', 'virginica', 'virginica', 'setosa', 'setosa',
       'virginica', 'versicol

## Step 4: Evaluate the frist model's predictions

We will learn more about how to evaluate the performance of a classifier in later lessons. For now we will use % accuracy as our metric. It is important to know that this metric only helps us understand the specific performance of our model and not, for example, where we can improve it, or where it already perfoms well.

1. Use ```.score()``` to evaluate the performance of our first model.

In [180]:
#evaluating the performace of our first model
#your code here

clf.score(X_test, y_test)

0.96

## Step 5: Question your results. 
What accuracy did you achieve? Is it 70, 90%? Anything above 70% is a good fit for our first result. How do we know it is reproducible? **If we run the model again and our performance is 85%, which one is correct**? And what about improving our model? 

## However ...
There is one crucial mistake that has been made in the exercise above -even if we achieved great results-. Can you spot it? You can go back to the lecture slides for inspiration. 

Sooo, at first I set shuffle=False in the train_test_split() and I got 0.03 accuracy. After enabling the shuffle accuracy is nearly 100%.

## Optional:
Repeat the cells you need to change in the exercise and run the classifier again. What is the new accuracy and why is this better?

In [1]:
#your code here