# Iris Prediction 

## Asking the right Question?
Use the Machine Learning Workflow to process and transform Iris data to create a prediction model.
This Model must predict what type of iris likely to be with 70% or greater accuracy.

### Import the libraries

In [99]:
import matplotlib.pyplot as plt

# Do ploting inline istead of in a seperate window
%matplotlib inline

### Load Iris Data

In [100]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()
X_iris = iris_dataset.data  # X is uppercase b/c it is matrix
y_iris = iris_dataset.target # y is lowercase b/c it is vector

print("Iris FEATURES shape")
print(X_iris.shape)
print("\nIris TARGET shape")
print(y_iris.shape)


Iris FEATURES shape
(150, 4)

Iris TARGET shape
(150,)


In [101]:
print("The Definition of features: \n{0}\n".format(iris_dataset.feature_names))
print("5 head of the feature data")
print(X_iris[:5])

The Definition of features: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

5 head of the feature data
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]


Each row is observer which is sample

In [102]:
print("5 head of the target data")
print(y_iris[:5])

5 head of the target data
[0 0 0 0 0]


### Check  setosa/versicolor/virginica ratio

In [103]:
num_sectosa = len(list(filter(lambda x: x == 0, y_iris)))
num_versicolor = len(list(filter(lambda x: x == 1, y_iris)))
num_virginica = len(list(filter(lambda x: x == 2, y_iris)))

print("Number of sectosa: {0} ({1:2.2f}%)".format(num_sectosa, num_sectosa/len(y_iris) * 100))
print("Number of versicolor: {0} ({1:2.2f}%)".format(num_versicolor, num_sectosa/len(y_iris) * 100))
print("Number of virginica: {0} ({1:2.2f}%)".format(num_virginica, num_sectosa/len(y_iris) * 100))

Number of sectosa: 50 (33.33%)
Number of versicolor: 50 (33.33%)
Number of virginica: 50 (33.33%)


## Preparing the data (Data is ready to use)

## Selecting the Initial Algorithms

Role of algorithm
    - Use solution statement filter algorithms
    - Discuss best algorithms
    - Select one initial algorithm
        - Learning Type => Prediction Model => Supervised ML
        - Result
            - Regression
            - *Classification
        - Complexity
            - *Simple
            - Ensemble algorithms
        - *Basic vs enhanced
    
There are over 50 algorithms in scikit-learn
    - Supervised ML algorithms: 28 
        - Regression: 8
        - Classifiction (Maybe Both): 20
            - Basic & Enhanced algorithms: 14
    - Others: 22

Based on above information, I am going to chose two simple initial algorithms
    1. KNN
    2. Logistic Regression

## Training the Model

### Splitting the data

70% for traing, 30% for testing

In [104]:
from sklearn.model_selection import train_test_split

split_test_size = 0.30
# random_state (could be any number) gurantees the split is always going to be the same when it splits the data again
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size = split_test_size, random_state=42)

We check to ensure we have the desired 70% train  and 30% test split of data

In [105]:
print("{0:.2f}% in training set".format(len(X_train)/len(X_iris)*100))
print("{0:.2f}% in test set".format(len(X_test)/len(X_iris)*100))

70.00% in training set
30.00% in test set


### Verifying predicted value was split correctly

In [106]:
print("Original sectosa: {0} ({1:2.2f}%)".format(num_sectosa, num_sectosa/len(X_iris) * 100))
print("Original versicolor: {0} ({1:2.2f}%)".format(num_versicolor, num_versicolor/len(X_iris) * 100))
print("Original virginica: {0} ({1:2.2f}%)".format(num_virginica, num_virginica/len(X_iris) * 100))
print("Original total: {0}".format(len(X_iris)))
print("")

num_sectosa_train = len(list(filter(lambda x: x == 0, y_train)))
num_versicolor_train = len(list(filter(lambda x: x == 1, y_train)))
num_virginica_train = len(list(filter(lambda x: x == 2, y_train)))

print("Training sectosa: {0} ({1:2.2f}%)".format(num_sectosa_train, num_sectosa_train/len(y_train) * 100))
print("Training versicolor: {0} ({1:2.2f}%)".format(num_versicolor_train, num_versicolor_train/len(y_train) * 100))
print("Training virginica: {0} ({1:2.2f}%)".format(num_virginica_train, num_virginica_train/len(X_iris) * 100))
print("Training total: {0} ({1:2.2f}%)".format(len(y_train), len(y_train)/len(X_iris) * 100))
print("")

num_sectosa_test = len(list(filter(lambda x: x == 0, y_test)))
num_versicolor_test = len(list(filter(lambda x: x == 1, y_test)))
num_virginica_test = len(list(filter(lambda x: x == 2, y_test)))

print("Test sectosa: {0} ({1:2.2f}%)".format(num_sectosa_test, num_sectosa_test/len(y_test) * 100))
print("Test versicolor: {0} ({1:2.2f}%)".format(num_versicolor_test, num_versicolor_test/len(y_test) * 100))
print("Test virginica: {0} ({1:2.2f}%)".format(num_virginica_test, num_virginica_test/len(y_test) * 100))
print("Test total: {0} ({1:2.2f}%)".format(len(y_test), len(y_test)/len(X_iris) * 100))

Original sectosa: 50 (33.33%)
Original versicolor: 50 (33.33%)
Original virginica: 50 (33.33%)
Original total: 150

Training sectosa: 31 (29.52%)
Training versicolor: 37 (35.24%)
Training virginica: 37 (24.67%)
Training total: 105 (70.00%)

Test sectosa: 19 (42.22%)
Test versicolor: 13 (28.89%)
Test virginica: 13 (28.89%)
Test total: 45 (30.00%)


### Training KNN algorithm

In [111]:
from sklearn.neighbors import KNeighborsClassifier

# n_neighbors is one of regularization hyperparameters
knn_model = KNeighborsClassifier(n_neighbors = 5)
knn_model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

#### Performance on training data 

In [108]:
# import the performance metrics library from scikit-learn
from sklearn import metrics

knn_predict_train = knn_model.predict(X_train)

# Accuracy
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, knn_predict_train)))

Accuracy: 0.9524


!!! KNN algorithm learnt the training data too well which causes the overfitting.

#### Performance on test data 

In [109]:
knn_predict_test = knn_model.predict(X_test)

# Accuracy
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, knn_predict_test)))

Accuracy: 1.0000


###  Metrics

In [131]:
print("Confussion Matrix")
print(metrics.confusion_matrix(y_test, knn_predict_test, labels=[0,1,2]))

Confussion Matrix
[[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]


Confussion Matrix

    [[TP,FP]
     [FN, TN]]
     
    - TP: Predicted = yes, Actual = yes
    - FP: Predicted = yes, Actual = no
    - FN: Predicted = no,  Actual = yes
    - TN: Predicted = no, Actual  = no

In [130]:
print("Classification Report")
print(metrics.classification_report(y_test, knn_predict_test, labels=[0,1,2]))

Classification Report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        19
          1       1.00      1.00      1.00        13
          2       1.00      1.00      1.00        13

avg / total       1.00      1.00      1.00        45



Mathematically
    - recall: TP/(TP+FN)
    - precision: TP/(TP+FP)

### Training Logistis Regression

In [117]:
from sklearn.linear_model import LogisticRegression\

lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train,y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

####  Performance on training

In [133]:
lr_predict_train = lr_model.predict(X_train)

# Test metrics
# Accuracy
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, lr_predict_train)))
print("")
print("Metrics")
print("Confussion matrix")
print(metrics.confusion_matrix(y_train, lr_predict_train, labels=[0,1,2]))
print("")
print("Classification Report")
print(metrics.classification_report(y_train, lr_predict_train, labels=[0,1,2]))

Accuracy: 0.9619

Metrics
Confussion matrix
[[31  0  0]
 [ 0 33  4]
 [ 0  0 37]]

Classification Report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        31
          1       1.00      0.89      0.94        37
          2       0.90      1.00      0.95        37

avg / total       0.97      0.96      0.96       105



#### Performance on testing

In [134]:
lr_predict_test = lr_model.predict(X_test)

# Train metrics
# Accuracy
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, lr_predict_test)))
print("")
print("Metrics")
print("Confussion matrix")
print(metrics.confusion_matrix(y_test, lr_predict_test, labels=[0,1,2]))
print("")
print("Classification Report")
print(metrics.classification_report(y_test, lr_predict_test, labels=[0,1,2]))

Accuracy: 0.9778

Metrics
Confussion matrix
[[19  0  0]
 [ 0 12  1]
 [ 0  0 13]]

Classification Report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        19
          1       1.00      0.92      0.96        13
          2       0.93      1.00      0.96        13

avg / total       0.98      0.98      0.98        45

