# Week 02 Homework: Iris Classification

Homework 02 entails us improving the Iris classification models:

**Ways to improve on the Iris models**

1. Adjust hyperparameters of models
2. Add features
     - Try Length / Width
     - Use Unsupervised model (K-Means)
3. Add models to set
    - XGBoost
    - lightGBM
4. Add Visualization

In [1]:
# import libraries

import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

In [2]:
# ignore warnings

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Load dataset

In [3]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(url, names=names)

## Explore Data

In [4]:
# shape
print("Shape: ", dataset.shape, '\n')

# head
print("First records of data:\n", dataset.head(), '\n')

# class distribution
print('Class Distribution:')
print(dataset.groupby('class').size())

Shape:  (150, 5) 

First records of data:
    sepal-length  sepal-width  petal-length  petal-width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa 

Class Distribution:
class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64


In [5]:
print("Statistical description of data:\n",dataset.describe())

Statistical description of data:
        sepal-length  sepal-width  petal-length  petal-width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


---

## Split out validation dataset

In [6]:
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7

X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(
    X,
    Y,
    test_size=validation_size,
    random_state=seed
)

## Create Model Shells

In [7]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF', RandomForestClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('GB', GradientBoostingClassifier()))

models

[('LR',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='warn',
            n_jobs=None, penalty='l2', random_state=None, solver='warn',
            tol=0.0001, verbose=0, warm_start=False)),
 ('LDA',
  LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                solver='svd', store_covariance=False, tol=0.0001)),
 ('KNN',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=None, n_neighbors=5, p=2,
             weights='uniform')),
 ('CART',
  DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, presort=False, random_state=None,
              splitter='best')),
 ('RF'

## Spot test each model with Cross-Validation

In [8]:
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'

results = []
names = []

# evaluate each model in turn
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(
        model, X_train, Y_train, cv=kfold, scoring=scoring
    )
    results.append(cv_results)
    names.append(name)
    msg = f"{name}: {cv_results.mean():0.4f} ({cv_results.std():0.4f})"
    print(msg)

LR: 0.9667 (0.0408)
LDA: 0.9750 (0.0382)
KNN: 0.9833 (0.0333)
CART: 0.9750 (0.0382)
RF: 0.9667 (0.0408)
NB: 0.9750 (0.0534)
SVM: 0.9917 (0.0250)
GB: 0.9583 (0.0417)


## Make predictions on validation dataset

K-Nearest Neighbors:

In [9]:
def make_prediction(model):
    print(model)
    model.fit(X_train, Y_train)
    predictions = model.predict(X_validation)

    print("Accuracy Score:", accuracy_score(Y_validation, predictions))
    print("\nConfusion Matrix:\n", confusion_matrix(Y_validation, predictions))
    print("\nClassification Report:\n", classification_report(Y_validation, predictions))

In [10]:
make_prediction(KNeighborsClassifier())

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')
Accuracy Score: 0.9

Confusion Matrix:
 [[ 7  0  0]
 [ 0 11  1]
 [ 0  2  9]]

Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.85      0.92      0.88        12
 Iris-virginica       0.90      0.82      0.86        11

      micro avg       0.90      0.90      0.90        30
      macro avg       0.92      0.91      0.91        30
   weighted avg       0.90      0.90      0.90        30



In [11]:
for name, model in models:
    print(name)
    make_prediction(model)
    print('\n\n')

LR
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
Accuracy Score: 0.8

Confusion Matrix:
 [[ 7  0  0]
 [ 0  7  5]
 [ 0  1 10]]

Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.88      0.58      0.70        12
 Iris-virginica       0.67      0.91      0.77        11

      micro avg       0.80      0.80      0.80        30
      macro avg       0.85      0.83      0.82        30
   weighted avg       0.83      0.80      0.80        30




LDA
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='svd', store_covariance=False, tol=0.0001)
Accuracy Score: 0.9666666666666667

Confusion Matrix:
 [[ 7  0  0]
 [ 0 11  1]
 [ 0

---

## Use K Nearest Neighbor with GridSearch

In [12]:
KNeighborsClassifier()

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [13]:
param_grid = {
    'n_neighbors': np.arange(1, 20)
}

knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid, cv=10)

knn_cv.fit(X_train, Y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [14]:
print('Best Score:', knn_cv.best_score_,'| Best Parameter K: ', knn_cv.best_params_)

Best Score: 0.9916666666666667 | Best Parameter K:  {'n_neighbors': 13}


Let's make a prediction using the suggest model above:

In [15]:
make_prediction(KNeighborsClassifier(n_neighbors=13))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=13, p=2,
           weights='uniform')
Accuracy Score: 0.9

Confusion Matrix:
 [[ 7  0  0]
 [ 0 10  2]
 [ 0  1 10]]

Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.91      0.83      0.87        12
 Iris-virginica       0.83      0.91      0.87        11

      micro avg       0.90      0.90      0.90        30
      macro avg       0.91      0.91      0.91        30
   weighted avg       0.90      0.90      0.90        30



---

## Use XGBoost

In [16]:
# map the iris class to numerical values
class_map = {
    'Iris-setosa': 0,
    'Iris-versicolor': 1,
    'Iris-virginica': 2
}
Y_new = [class_map[iris_class] for iris_class in Y]

# split data into train and validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(
    X,
    Y_new,
    test_size=validation_size,
    random_state=seed
)

d_train = xgb.DMatrix(X_train, label=Y_train)
d_validate = xgb.DMatrix(X_validation, label=Y_validation)

# set xgboost params
param = {
    'max_depth': 3,
    'eta': 0.3,
    'silent': 1,
    'objective': 'multi:softprob',
    'num_class': 3
}
num_round = 20

# train model
bst = xgb.train(param, d_train, num_round)

# predict
predictions = bst.predict(d_validate)

# pick best prediction results
best_predictions = np.asarray([np.argmax(line) for line in predictions])

print(accuracy_score(Y_validation, best_predictions))

0.9


**Observations**:

Using XGBoost didn't improve our prediction result. 

One way to further improve this is by using GridSearch to perform an exhaustive search over specified parameter values for an estimator. This is shown below, which didn't end up working so well. Kept there for future learning/reference purposes.

In [17]:
# map the iris class to numerical values
class_map = {
    'Iris-setosa': 0,
    'Iris-versicolor': 1,
    'Iris-virginica': 2
}
Y_new = [class_map[iris_class] for iris_class in Y]

# split data into train and validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(
    X,
    Y_new,
    test_size=validation_size,
    random_state=seed
)

d_train = xgb.DMatrix(X_train, label=Y_train)
d_validate = xgb.DMatrix(X_validation, label=Y_validation)

# instantiate model

# set static model parameters
param = {
    'silent': 1, # verbosity didn't work although it is recommended in the docs as silent is deprecated
    'objective': 'multi:softprob',
    'num_class': 3
}
num_round = 20

model = xgb.XGBClassifier(param)

# set param_grid
param_grid = {
    'max_depth': np.arange(2, 13),
    'eta': [0.1,0.2,0.3,0.4]
}

param_grid = {
    'num_boost_round': [100, 250, 500],
    'eta': [0.05, 0.1, 0.3],
    'max_depth': [6, 9, 12],
    'subsample': [0.9, 1.0],
    'colsample_bytree': [0.9, 1.0],
}

xgb_cv = GridSearchCV(model, param_grid=param_grid, cv=3, verbose=1).fit(X_train, Y_train)

# train model
#bst = xgb.train(param, d_train, num_round)

# predict
#predictions = bst.predict(d_validate)

# pick best prediction results
#best_predictions = np.asarray([np.argmax(line) for line in predictions])

#print(accuracy_score(Y_validation, best_predictions))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=1)]: Done 324 out of 324 | elapsed:    2.4s finished


In [18]:
xgb_cv

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth={'silent': 1, 'objective': 'multi:softprob', 'num_class': 3},
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'num_boost_round': [100, 250, 500], 'eta': [0.05, 0.1, 0.3], 'max_depth': [6, 9, 12], 'subsample': [0.9, 1.0], 'colsample_bytree': [0.9, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [19]:
# Best hyperparameter setting
xgb_cv.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.9, eta=0.05, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=6, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None, num_boost_round=100,
       objective='multi:softprob', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.9)

In [20]:
best_model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.9, eta=0.05, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=6, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None, num_boost_round=100,
       objective='multi:softprob', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.9)


# train model
best_model.fit(X_train, Y_train)

# predict
predictions = best_model.predict(X_validation)

# pick best prediction results
best_predictions = np.asarray([np.argmax(line) for line in predictions])

print(accuracy_score(Y_validation, best_predictions))

0.23333333333333334


---

# Resources

- [Simple XGBoost Tutorial](https://www.kdnuggets.com/2017/03/simple-xgboost-tutorial-iris-dataset.html)
- [GridSearch XGBoost with Scikit-Learn](https://www.kaggle.com/tanitter/grid-search-xgboost-with-scikit-learn)
- [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html)
- [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [Scikit-Learn Classifiers on Iris Dataset](https://www.kaggle.com/chungyehwang/scikit-learn-classifiers-on-iris-dataset)
- [Iris Dataset EDA](https://www.kaggle.com/lalitharajesh/iris-dataset-exploratory-data-analysis)
- [XGBoost with Scikit-Learn](https://www.kaggle.com/stuarthallows/using-xgboost-with-scikit-learn)
- [Using clustering for feature engineering on the Iris Dataset](https://www.kaggle.com/stuarthallows/using-xgboost-with-scikit-learn)
- [Iris Dataset Kmeans](https://nbviewer.jupyter.org/github/arunabh15091989/Iris-Kmeans/blob/master/KMeans_Iris.ipynb)
