<h1><center>Scikit-Learn Comparison</center></h1>

Here we'll test our base estimator *MyLinearSVM* with scikit-learn's equivalent *LinearSVC* using the spam dataset.

In [2]:
import sys

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn import datasets
from matplotlib import pyplot as plt

sys.path.append("../libraries")
import example_utils as examp
import base_estimators as base
import multiclass_estimators as multi

plt.rcParams["figure.figsize"] = (14,8)

## Load and transform data for binary classification

In [14]:
url = "https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.data"
cnames = ["col_"+str(x) for x in range(57)] + ["target"]
spam = pd.read_table(url, sep = "\s+", names = cnames)

X = spam.drop("target", 1).copy()
y = spam["target"].copy()

# Change target to -1/+1
y[y==0] = -1
# Divide the data into training and test sets. By default, 25% goes into the test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Standardize the data
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Test learning performance between LinearSVC and MyLinearSVM

With default parameters.

In [18]:
model1 = LinearSVC()
model1.fit(X_train, y_train)

train_preds = model1.predict(X_train)
test_preds = model1.predict(X_test)

print("Binary Classification training accuracy for sklearn: {}".format(accuracy_score(y_train, train_preds)))
print("Binary Classification testing accuracy for sklearn: {}".format(accuracy_score(y_test, test_preds)))

Binary Classification training accuracy for sklearn: 0.9272463768115942
Binary Classification testing accuracy for sklearn: 0.9252823631624674


In [20]:
model2 = base.MyLinearSVM()
model2.fit(X_train, y_train)

train_preds = model2.predict(X_train)
test_preds = model2.predict(X_test)

print("Binary Classification training accuracy for sklearn: {}".format(accuracy_score(y_train, train_preds)))
print("Binary Classification testing accuracy for sklearn: {}".format(accuracy_score(y_test, test_preds)))

Binary Classification training accuracy for sklearn: 0.9063768115942029
Binary Classification testing accuracy for sklearn: 0.9122502172024327


It seems scikit-learn's *LinearSVC* has higher accuracy, but let's take a look at time performance.

In [21]:
model1 = LinearSVC()
%timeit model1.fit(X_train, y_train)

339 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
model = base.MyLinearSVM()
%timeit model.fit(X_train, y_train)

72.2 ms ± 4.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


There seems to be some overhead in the scikit-learn implementation, but surely this difference wouldn't be as pronounced with a larger dataset.