This notebook demonstrates how to use the HuberSVM implementation within the scikit-learn framework and how to create the benchmarks results shown in [TODO: link paper].

In [1]:
from huber_svm import HuberSVC

import pandas
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import ElasticNet, Lasso, RidgeClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold, cross_val_score
from sklearn.preprocessing import scale

Load the Iris dataset for benchmark.

In [2]:
data = load_iris()

X = scale(data['data'])
y = data['target']

idx = np.random.permutation(X.shape[0])
X = X[idx]
y = y[idx]

We will compare the HuberSVM with $l_1$, $l_2$ and elastic-net-regularized SVMs. We will optimize the regularization parameters for each SVM in nested cross validation with an outer loop of five folds and stratified sampling for accuracy. 

In [3]:
lasso = OneVsRestClassifier(Lasso())
param_lasso = {'estimator__alpha': [100, 10, 1, 0.1, 1e-2, 1e-3]}

elastic = OneVsRestClassifier(ElasticNet())
param_elastic = {'estimator__alpha': [100, 10, 1, 0.1, 1e-2, 1e-3], 
                 'estimator__l1_ratio': np.linspace(0.01, 0.99, 5)}

ridge = RidgeClassifier(solver='lsqr')
param_ridge = {'alpha': [100, 10, 1, 0.1, 1e-2, 1e-3]}

huber = OneVsRestClassifier(HuberSVC())
param_huber = {'estimator__C': [100, 10, 1, 0.1, 1e-2, 1e-3], 
              'estimator__lambd': [100, 10, 1, 0.1, 1e-2, 1e-3], 
              'estimator__mu': [100, 10, 1, 0.1, 1e-2, 1e-3]}

n_folds = 5
param_folds = 3
scoring = 'accuracy'

Main benchmark loop over folds and SVMs.

In [4]:
result_df = pandas.DataFrame()
for i, (train_index, test_index) in enumerate(StratifiedKFold(y, n_folds=n_folds)):
    for clf_name, clf, param_grid in [('Lasso', lasso, param_lasso), 
                                      ('ElasticNet', elastic, param_elastic), 
                                      ('Ridge', ridge, param_ridge), 
                                      ('HuberSVC', huber, param_huber)]:

        gs = GridSearchCV(clf, param_grid, scoring=scoring, cv=param_folds, n_jobs=-1)
        gs.fit(X[train_index], y[train_index])
        best_clf = gs.best_estimator_

        score = accuracy_score(y[test_index], best_clf.predict(X[test_index]))
        result_df.loc[i, clf_name] = score

The results show that the HuberSVM outperforms other regularized SVMs in each run and achieves the highest empirical accuracy.

In [5]:
result_df.loc['Mean'] = result_df.mean()
pandas.options.display.float_format = '{:,.3f}'.format
result_df

Unnamed: 0,Lasso,ElasticNet,Ridge,HuberSVC
0,0.8,0.767,0.8,1.0
1,0.867,0.867,0.967,0.967
2,0.833,0.833,0.833,0.967
3,0.7,0.733,0.733,0.9
4,0.833,0.8,0.8,0.9
Mean,0.807,0.8,0.827,0.947
