In [1]:
from sklearn import svm
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=4)

# runs the util notebook so that those functions are available
%run utils.ipynb

* Split data into test train split
* Do a baseline default settings SVM model with linear one against the rest
* Iterating through runs of SVM with different hyperparameters to find the best hyperparameters, using GridSearch. 
* Baseline with linear SVM, and *then explore options for nonlinear SVM. 
* Select the best model and justify
* Test accuracy of best model

In [17]:
X, y, features = load_standardized_beans()
X_train, X_valid, X_test, y_train, y_valid, y_test =  split(X,y)

In [18]:
# Doing a baseline SVM, with the default parameters. Default C is 1.0
classifier = svm.SVC(kernel="linear")

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_valid)
print('accuracy: ', np.mean(y_valid == y_pred))

accuracy:  0.9265515975027543


In [19]:
# But let's do cross validation, and with Stratified K folds, to makes sure we have a good sense of the baseline. 
from sklearn.model_selection import StratifiedKFold, cross_validate, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True)
accuracy = cross_val_score(classifier, X, y, cv=cv, 
                           scoring='accuracy')
np.mean(accuracy)

0.9263829521994449

Scikit Learn's documentation says that the One Versus One option for the decision_function_shape argument is deprecated, and that One Versus Rest is both recommended and the default. Doesn't seem to change the results much, so moving forward we will leave it with the default that the package recommends, the One Versus Rest. 

In [22]:
classifier = svm.SVC(kernel="linear", decision_function_shape='ovo')

classifier.fit(X_train, y_train)
accuracy = cross_val_score(classifier, X, y, cv=cv, 
                           scoring='accuracy')
np.mean(accuracy)

0.9263097466461845

Just to verify our assumption, before we proceed forward, that the standardized data provides a benifit over the raw data:

In [14]:
# Trying SVM classification with the non-standardized data

X, y, features = load_beans()

X_train, X_valid, X_test, y_train, y_valid, y_test =  split(X,y)

classifier = svm.SVC(kernel="linear")

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_valid)
print('accuracy: ', np.mean(y_valid == y_pred))
cv = StratifiedKFold(n_splits=5, shuffle=True)
accuracy = cross_val_score(classifier, X, y, cv=cv, 
                           scoring='accuracy')
np.mean(accuracy)

Yes. It looks like the standardized data provides some accuracy benefit over the raw data. Maybe the biggest improvement is with computation time. Regardless, we will leave the raw data behind now, and only work with the standardized data. 

Now that we have established a baseline model accuracy of **92.6%** using a linear SVM model with scikit learn's defaults and a standardized dataset, we will proceed to tuning our hyperparameters and seeing if we can find a better model for predicting the dry bean varieties. 