# Classification

In Machine Lerning, classification is used when you have labelled data and your goal is to fit categorical output from input data. 

This notebook covers the basic pattern for creating a map of hyper-parameters and using Grid Search CV to do a rudimentary comparison of models' performance.

## Imports

Imports all of the libraries for the notebook.

In [1]:
%matplotlib inline
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import graphviz

from sklearn.datasets import load_iris, load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

import seaborn as sns

## Read in Data

Two datasets are read in for the purpose of demonstrating classification algorithms: wine and iris.

In [2]:
data_wine = load_wine()
print("Feature names:", data_wine.feature_names)
print("Target names:", data_wine.target_names)
print("Shape:", data_wine.data.shape)

X_train, X_test, y_train, y_test = train_test_split(data_wine.data, data_wine.target, test_size=0.20)

Feature names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Target names: ['class_0' 'class_1' 'class_2']
Shape: (178, 13)


In [43]:
data_iris = load_iris()
print("Feature names:", data_iris.feature_names)
print("Target names:", data_iris.target_names)
print("Shape:", data_iris.data.shape)

X_train, X_test, y_train, y_test = train_test_split(data_iris.data, data_iris.target, test_size=0.20)

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Shape: (150, 4)


## Building Models

Using GridSearch with sk-learn The pattern typically involves:

1. Defining a model instance
1. Defining a map of parameters for the gridsearch to search over
1. Defining a GridSearch object setting parameters for things such as cross-validation and parallelization
1. Executing the GridSearch on the training data
1. Examining the results of the gridsearch

### Logistic Regression


In [3]:
lr = LogisticRegression()
parameters = {'fit_intercept': (True, False), 'penalty':('l1', 'l2')}
clf = GridSearchCV(lr, parameters, cv=5, return_train_score=True)
clf.fit(X_train, y_train)
cv_results = pd.DataFrame(clf.cv_results_)
print(cv_results[['params', 'mean_fit_time', 'mean_score_time', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False))

                                      params  mean_fit_time  mean_score_time  \
0   {'fit_intercept': True, 'penalty': 'l1'}       0.008978         0.003719   
1   {'fit_intercept': True, 'penalty': 'l2'}       0.002741         0.000593   
2  {'fit_intercept': False, 'penalty': 'l1'}       0.006629         0.000202   
3  {'fit_intercept': False, 'penalty': 'l2'}       0.001562         0.000195   

   mean_test_score  
0         0.950704  
1         0.950704  
2         0.950704  
3         0.950704  


### Decision Trees

In [4]:
dt = DecisionTreeClassifier()
parameters = {'criterion': ('gini', 'entropy'), 'min_samples_leaf':[1, 3, 5]}
clf = GridSearchCV(dt, parameters, cv=5, return_train_score=True)
clf.fit(X_train, y_train)
cv_results = pd.DataFrame(clf.cv_results_)
print(cv_results[['params', 'mean_fit_time', 'mean_score_time', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False))

                                            params  mean_fit_time  \
3  {'criterion': 'entropy', 'min_samples_leaf': 1}       0.000967   
4  {'criterion': 'entropy', 'min_samples_leaf': 3}       0.001367   
2     {'criterion': 'gini', 'min_samples_leaf': 5}       0.000390   
0     {'criterion': 'gini', 'min_samples_leaf': 1}       0.001562   
1     {'criterion': 'gini', 'min_samples_leaf': 3}       0.000783   
5  {'criterion': 'entropy', 'min_samples_leaf': 5}       0.000586   

   mean_score_time  mean_test_score  
3         0.000195         0.894366  
4         0.000000         0.894366  
2         0.000195         0.887324  
0         0.000586         0.880282  
1         0.000195         0.880282  
5         0.000195         0.880282  


### kNN

In [5]:
nbrs = KNeighborsClassifier()
parameters = {'n_neighbors':[1, 3, 5, 10], 'weights': ('uniform', 'distance')}
clf = GridSearchCV(nbrs, parameters, cv=5, return_train_score=True)
clf.fit(X_train, y_train)
cv_results = pd.DataFrame(clf.cv_results_)
print(cv_results[['params', 'mean_fit_time', 'mean_score_time', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False))

                                       params  mean_fit_time  mean_score_time  \
2    {'n_neighbors': 3, 'weights': 'uniform'}       0.000584         0.000781   
6   {'n_neighbors': 10, 'weights': 'uniform'}       0.000195         0.000772   
0    {'n_neighbors': 1, 'weights': 'uniform'}       0.000781         0.001367   
1   {'n_neighbors': 1, 'weights': 'distance'}       0.000388         0.000772   
3   {'n_neighbors': 3, 'weights': 'distance'}       0.000585         0.000976   
7  {'n_neighbors': 10, 'weights': 'distance'}       0.000391         0.000585   
4    {'n_neighbors': 5, 'weights': 'uniform'}       0.000780         0.000195   
5   {'n_neighbors': 5, 'weights': 'distance'}       0.000000         0.000979   

   mean_test_score  
2         0.725352  
6         0.725352  
0         0.711268  
1         0.711268  
3         0.711268  
7         0.711268  
4         0.690141  
5         0.669014  


### Neural Nets

In [15]:
mlp = MLPClassifier(solver='sgd', learning_rate='constant')
parameters = {'hidden_layer_sizes':[(50), (100), (150), (100, 50)], 'learning_rate_init': [0.0005, 0.001, 0.005, 0.01]}
clf = GridSearchCV(mlp, parameters, cv=5, return_train_score=True)
clf.fit(X_train, y_train)
cv_results = pd.DataFrame(clf.cv_results_)
# print(cv_results)
print(cv_results[['param_hidden_layer_sizes', 'param_learning_rate_init', 'mean_fit_time', 'mean_score_time', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False).head(10))

   param_hidden_layer_sizes param_learning_rate_init  mean_fit_time  \
4                       100                   0.0005       0.019715   
8                       150                   0.0005       0.007036   
0                        50                   0.0005       0.028889   
3                        50                     0.01       0.003513   
12                (100, 50)                   0.0005       0.007426   
10                      150                    0.005       0.016594   
2                        50                    0.005       0.004081   
14                (100, 50)                    0.005       0.008197   
1                        50                    0.001       0.023618   
6                       100                    0.005       0.005261   

    mean_score_time  mean_test_score  
4          0.000195         0.387324  
8          0.000392         0.387324  
0          0.000000         0.373239  
3          0.000195         0.373239  
12         0.000391    

### Ensembles

#### Random Forest

In [16]:
rf = RandomForestClassifier()
parameters = {'max_depth': [1, 3, 5, 10],
              'min_samples_leaf': [1, 3, 5, 10, 20],
              'n_estimators': [5, 10, 20, 50]}
clf = GridSearchCV(rf, parameters, cv=5, return_train_score=True)
clf.fit(X_train, y_train)
cv_results = pd.DataFrame(clf.cv_results_)
print(cv_results[['params', 'param_n_estimators', 'mean_fit_time', 'mean_score_time', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False).head(10))

                                               params param_n_estimators  \
23  {'max_depth': 3, 'min_samples_leaf': 1, 'n_est...                 50   
61  {'max_depth': 10, 'min_samples_leaf': 1, 'n_es...                 10   
47  {'max_depth': 5, 'min_samples_leaf': 3, 'n_est...                 50   
51  {'max_depth': 5, 'min_samples_leaf': 5, 'n_est...                 50   
26  {'max_depth': 3, 'min_samples_leaf': 3, 'n_est...                 20   
25  {'max_depth': 3, 'min_samples_leaf': 3, 'n_est...                 10   
22  {'max_depth': 3, 'min_samples_leaf': 1, 'n_est...                 20   
58  {'max_depth': 5, 'min_samples_leaf': 20, 'n_es...                 20   
63  {'max_depth': 10, 'min_samples_leaf': 1, 'n_es...                 50   
39  {'max_depth': 3, 'min_samples_leaf': 20, 'n_es...                 50   

    mean_fit_time  mean_score_time  mean_test_score  
23       0.056811         0.002928         0.978873  
61       0.009954         0.000976         0.978873  
4

In [17]:
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
parameters = {'base_estimator__criterion':['gini', 'entropy'], 
              'base_estimator__min_samples_leaf': [1, 3, 5, 10, 20],
              'n_estimators': [5, 10, 20, 50]}
clf = GridSearchCV(ada, parameters, cv=5, return_train_score=True)
clf.fit(X_train, y_train)
cv_results = pd.DataFrame(clf.cv_results_)
print(cv_results[['params', 'param_base_estimator__min_samples_leaf', 'param_n_estimators', 'mean_fit_time', 'mean_score_time', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False).head(10))

                                               params  \
15  {'base_estimator__criterion': 'gini', 'base_es...   
9   {'base_estimator__criterion': 'gini', 'base_es...   
8   {'base_estimator__criterion': 'gini', 'base_es...   
26  {'base_estimator__criterion': 'entropy', 'base...   
17  {'base_estimator__criterion': 'gini', 'base_es...   
13  {'base_estimator__criterion': 'gini', 'base_es...   
39  {'base_estimator__criterion': 'entropy', 'base...   
6   {'base_estimator__criterion': 'gini', 'base_es...   
18  {'base_estimator__criterion': 'gini', 'base_es...   
10  {'base_estimator__criterion': 'gini', 'base_es...   

   param_base_estimator__min_samples_leaf param_n_estimators  mean_fit_time  \
15                                     10                 50       0.087441   
9                                       5                 10       0.015804   
8                                       5                  5       0.007019   
26                                      3               

In [9]:
gbm = GradientBoostingClassifier()
parameters = {'min_samples_leaf': [1, 3, 5, 10, 20],
              'n_estimators': [5, 10, 20, 50, 100]}
clf = GridSearchCV(gbm, parameters, cv=5, return_train_score=True)
clf.fit(X_train, y_train)
cv_results = pd.DataFrame(clf.cv_results_)
print(cv_results[['params', 'mean_fit_time', 'mean_score_time', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False).head(10))

                                           params  mean_fit_time  \
24  {'min_samples_leaf': 20, 'n_estimators': 100}       0.171971   
19  {'min_samples_leaf': 10, 'n_estimators': 100}       0.206127   
23   {'min_samples_leaf': 20, 'n_estimators': 50}       0.086492   
18   {'min_samples_leaf': 10, 'n_estimators': 50}       0.114584   
17   {'min_samples_leaf': 10, 'n_estimators': 20}       0.038836   
16   {'min_samples_leaf': 10, 'n_estimators': 10}       0.015811   
22   {'min_samples_leaf': 20, 'n_estimators': 20}       0.029265   
13    {'min_samples_leaf': 5, 'n_estimators': 50}       0.083929   
15    {'min_samples_leaf': 10, 'n_estimators': 5}       0.008003   
9    {'min_samples_leaf': 3, 'n_estimators': 100}       0.160648   

    mean_score_time  mean_test_score  
24         0.000585         0.978873  
19         0.000978         0.971831  
23         0.000958         0.964789  
18         0.000585         0.964789  
17         0.000390         0.964789  
16         0.0003



Rather than studing the intricacies of a particular. Creating a good model, to a certain extent, can be simplified to trying various learning algorithms and combinations hyper-parameters that configure the learning algorithms. Certain care is still needed for things such as: understanding if the input data is actually predictive of the output and whether the learning algorithm 