#  Train and Optimize a Support Vector Machine and a Naive Bayes Classifier using Scikit-learn


Support Vector Machines and Naive Bayes classifier example with GridSearchCV hyperparamter optimization using sklearn-pipelines

&nbsp;

### Purpose
This example should support the reader's familiarity with training and tuning two important classes of classifier models.  A grid search is implemented for tuning each model over all the possible combinations of hyperparameters passed to the `GridSearchCV` function.  Cross-validation is then used to measure the efficacy of each hyperparameter combination and choose the variant with the highest accuracy as the most optimized.  These tasks are coordinated by **scikit-learn**'s `make_pipeline` function so that the code is intuitive to review.

### Data Description
For the sake of fast-tracking through the data hygenie phases, the data used in this example is the iris flower dataset that is shipped with **scikit-learn**.  It's boring but for explanation purposes it is useful.

### Classification Goal
The goal with this data was to accurately classify the species of *iris* flower genus as either:

* <span style="color:#1B5E20;">setosa</span>
* <span style="color:#1B5E20;">versicolor</span>
* <span style="color:#1B5E20;">virginica</span>

Classification is based on the following features of each flower instance:

* <span style="color:#E65100">sepal length</span>
* <span style="color:#E65100">sepal width</span>
* <span style="color:#E65100">petal length</span>
* <span style="color:#E65100">petal width</span>

### Dependencies
The following Python packages are requried for this script.

 - scikit-learn 
 - pandas

&nbsp;


## Step 1
**Load and Learn the Data**

In [1]:
# Import libraries for loading the data
from sklearn import datasets
import pandas as pd

In [2]:
# Define the data variable as 'flowers'
flowers = datasets.load_iris()

In [3]:
# View the data matrix features and a sample of its instances
# The instances of this matrix will serve as the independent variables
df = pd.DataFrame(data=flowers.data, columns=flowers.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
# View a description of the dependent variables
df_target = pd.DataFrame(data=flowers.target, columns=['Iris Class'])
df_target.describe()

Unnamed: 0,Iris Class
count,150.0
mean,1.0
std,0.819232
min,0.0
25%,0.0
50%,1.0
75%,2.0
max,2.0


In [5]:
# Corresponding iris class names mapped as nominal integer data: 0, 1, or 2
print(flowers.target_names)

['setosa' 'versicolor' 'virginica']


&nbsp;

&nbsp;

## Step 2
**Split into Train & Test Subsets**



The beautiful developers of **scikit-learn** have implemented a function that makes this step a breeze.

In [6]:
from sklearn.model_selection import train_test_split

# Define independent and dependent variables
X = flowers.data
y = flowers.target

# Split the data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)

It is convenient to sneak into this step the variable that describes the *k* for the K-Fold cross validation, where *k* is the number of folds.

> K-Fold Cross Validation:

> The data set is divided into $k$ equal sized subsets, with one subset as the testing set and the $k-1$ remaining subsets serving as the training sets.  This process is then repeated $k$ times on the data.  For each of the $k$ folds, the training sets are used to predict the testing set and the average error is then calculated as:


> $$ CV_{(k)} = \frac{1}{k} \sum_{i=1}^{k} (y_i - \hat{y_i})^2 $$

> where $k$ is the count of subsets, $y_i$ is a vector of the actual values, and $\hat{y_i}$ is a vector of the predicted values when trained on the other $k-1$ subsets, all for each repitition $i$ of the total $k$ folds.

We will perform 10 fold cross validation.

In [7]:
from sklearn.model_selection import KFold

# Define 10 fold cross-validation
cv = KFold(n_splits=10)

&nbsp;

&nbsp;

## Step 3
**Build, Train, & Test the SVM Pipeline**

We will use a `pipeline` to organize our grid of possible hyperparameters.  Pipelines will make your life easier and keep your code reader friendly.  They are a lot like Samuel Adams, in that they're always a good decision.

&nbsp;

The format for the hyperparameter dictionary variables are always the same when using `make_pipeline`.  

The format is **lowercase transformer/estimator object name** + 2 underscores + **hyperparamter** = list of desired hyperparameter values.

For example, if we wanted to use boring old `sklearn.linear_model.LinearRegression` as our estimator we would define our paramter dictionary object as `dict(linearregression__fit_intercept=[True, False], linearregression__normalize=[True, False])`.

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC


# Define the pipeline object
svc_pipeline = make_pipeline(SVC())

# Define parameter dictionary that we will pass into GridSearchCV
svc_parameters = dict(svc__kernel=['linear', 'poly', 'rbf'],
                      svc__degree=[2, 3, 4],
                      svc__shrinking=[True, False])

Next, define the `GridSearchCV` object and fit the training data to it.
The scoring parameter here is defined as `f1_micro` which calculates the F1 accuracy "metrics globally by counting the total true positives, false negatives, and false positives." [Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

<span style="color:red;font-weight:bold;">Notice</span>:  We input the 10-fold cross-validation object `cv` into our `GridSearchCV` function.

In [9]:
from sklearn.model_selection import GridSearchCV

# Define the model & input the hyperparameter dictionary objects into the GridSearchCV function
svm_model = GridSearchCV(svc_pipeline, param_grid=svc_parameters, scoring='f1_micro', cv=cv)

# Test the SVC estimator with its grid of hyperparamters on the training set
svm_model.fit(X_train, y_train)

GridSearchCV(cv=KFold(n_splits=10, random_state=None, shuffle=False),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'svc__kernel': ['linear', 'poly', 'rbf'], 'svc__degree': [2, 3, 4], 'svc__shrinking': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_micro', verbose=0)

The output of an alphabet soup is sign that everything worked.

Let's find out what hyperparameters were found to score the greatest F1 accuracy.

In [10]:
svm_model.best_params_

{'svc__degree': 2, 'svc__kernel': 'linear', 'svc__shrinking': True}

&nbsp;

We can see what hyperparameters were found to yield the highest accuracy above.

The `svc_model` estimator object will use the best hyperparameters when we call `.predict(X_test)` to test the accuracy of our optimized model.  The output is made reader friendly by using the `sklearn.metrics.classification_report` function.

In [11]:
from sklearn.metrics import classification_report

# Predict the values of y_test from inputting X_test
svm_y_predicted = svm_model.predict(X_test)

# Print a reader friendly classification report
svm_report = classification_report(y_test, svm_y_predicted)
print(svm_report)

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        10
          1       1.00      1.00      1.00         8
          2       1.00      1.00      1.00        12

avg / total       1.00      1.00      1.00        30



Don't expect results like this when working with real world data.  A perfect accuracy score on a real life data set is a sign that something is wrong.  Possible sources of error could be overfitting or insufficient testing sample size.

&nbsp;

&nbsp;
## Step 4
**Build, Train, and Test Naive Bayes Estimator**

The procedure here is nearly identical to that of step 3 except now we will use the `GaussianNB` estimator.  We will forgo hyperparamter optimization here, but still use `GridSearchCV` for it's easy cross-validation ability.

In [12]:
from sklearn.naive_bayes import GaussianNB

# Define pipeline object
nb_pipeline = make_pipeline(GaussianNB())

# Define the model & input an empty hyperparameter dictionary
nb_model = GridSearchCV(nb_pipeline, param_grid=dict(), scoring='f1_micro', cv=cv)

# Fit the training data
nb_model.fit(X_train, y_train)

GridSearchCV(cv=KFold(n_splits=10, random_state=None, shuffle=False),
       error_score='raise',
       estimator=Pipeline(memory=None, steps=[('gaussiannb', GaussianNB(priors=None))]),
       fit_params=None, iid=True, n_jobs=1, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_micro', verbose=0)

In [13]:
# Predict the values using the Naive Bayes classifier
nb_y_predicted = nb_model.predict(X_test)

# Print a reader friendly classification report
nb_report = classification_report(y_test, nb_y_predicted)
print(nb_report)

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        10
          1       0.89      1.00      0.94         8
          2       1.00      0.92      0.96        12

avg / total       0.97      0.97      0.97        30



The optimized SVM classifier was shown to be have an F1 accuracy score 3% greater than that of the Naive Bayes classifier.  Therefore when predicting the species of each *iris* genus flower instance based on sepal width/length and petal width/length, the SVM classifier should be preferred over the Naive Bayes.

### Conclusion

The point of this script is to give examples of how the different machine learning modules from the sklearn package can be used together to write code that is easy to read and hopefully intuitive to write.

The code from steps 3 and 4 showcase how `GridSearchCV` can readily be utilized by an sklearn `pipeline` to choose the best hyperparameters for the specified estimator algorithm.  The best hyperparameters are chosen via the cross-validation object, which was defined as `cv`, using the F1 efficiency algorithm.

I understand that the code may seem dense if you are at the initial stage of the learning curve with sklearn.  I did make a conscious effort to comment my code above in hopes that it would support reader understanding.  I encourage you to grab a different data set and emulate these pipelines until you feel comfortable with this process.

If you believe, just from reading this script, that you can code better than me then I encourage you to write all your *better* code, by hand, on the back of the fast food wrapper that you had for lunch, and then immediately put it into the trash.  Everyone will be very impressed.