Q1. Recall that $N$ is the size of the data set and $d$ is the dimensionality of the input space. The original formulation of the hard-margin SVM problem (minimize $\frac{1}{2}\mathbf{w}^T\mathbf{w}$ subject to the inequality constraints), without going through the Lagrangian dual problem, is

A. The primal problem has $d+1$ variables (the dimensionality of that data set plus the bias term), whereas the dual problem has $N+1$ variables. The benefit of the dual is that its complexity does not depend on the dimensionality of the dataset, which can be extremely large when using non-linear transforms. 

__ A1. [d] a quadratic programming problem with $d+1$ variables. __

In [1]:
import pandas as pd
import numpy as np

# load data into dataframes
train = pd.read_csv("http://www.amlbook.com/data/zip/features.train", header = None, delim_whitespace= True, dtype = float)
test = pd.read_csv("http://www.amlbook.com/data/zip/features.test", header = None, delim_whitespace= True, dtype = float)

# split data into x and y (first column is y, rest are features of digit intensity and symmetry)
train_x = np.array(train[train.columns[1:]])
train_y = np.array(train[train.columns[0]])

test_x = np.array(test[train.columns[1:]])
test_y = np.array(test[train.columns[0]])

In the rest of the problems of this homework set, we apply soft-margin SVM to
handwritten digits from the processed US Postal Service Zip Code data set. Download
the data (extracted features of intensity and symmetry) for training and testing:

http://www.amlbook.com/data/zip/features.train
http://www.amlbook.com/data/zip/features.test


(the format of each row is: __digit intensity symmetry__). We will train two types
of binary classifiers; one-versus-one (one digit is class $+1$ and another digit is class
$−1$, with the rest of the digits disregarded), and $one-versus-all (one digit is class +1
and the rest of the digits are class −1).
The data set has thousands of points, and some quadratic programming packages
cannot handle this size. We recommend that you use the packages in libsvm:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Implement SVM with soft margin on the above zip-code data set by solving

\begin{align}
\min_{\alpha} \quad  &\frac{1}{2} \sum_{n=1}^{N} \sum_{n=1}^{N} \alpha_n \alpha_m y_n y_m K(\mathbf{x}_n, \mathbf{x}_m) - \sum_{n = 1}^{N} \alpha_n \\
\text{s.t.} \quad  &\sum_{n=1}^{N} y_n \alpha_n = 0  \\
& 0 \leq \alpha_n \leq C \quad n = 1, \ldots, N 
\end{align}

Consider the polynomial kernel $K(\mathbf{x}_n, \mathbf{x}_m) = (1 + \mathbf{x}_n^T \mathbf{x}_m)^Q$, where $Q$ is the highest degree of the polynomial. 

Q2. With $C = 0.01$ and $Q = 2$, which of the following classifiers has the __highest__ $E_\text{in}$?

In [2]:
# using sklearn.svm for its implementation of libsvm since cvxopt can't handle data size
from sklearn import svm

# since we are doing training One Versus All classifier for each digits 
# (i.e., the 0-classifier will be trained to determine whether a digit is 0 or not)
# we preprocess the ys for each class to be +1 (if that digit) or -1 (not that digit) rather than integer representing a class

# list of digit classifiers with specified soft margin, polynomial kernel of degree 2, gamma (constant) = 1
classifiers = [svm.SVC(C = 0.01, kernel = 'poly', degree = 2, gamma = 1.0, coef0=1) for k in range(10)]

# arrays of in sample and out of sample errors
eins = np.empty(10)
eouts = np.empty(10)

# for every digit
for k in range(10):
    # get a new binary classified array for every digit for both training and testing data
    train_digit_y = np.where(train_y == k, 1, -1)
    test_digit_y = np.where(test_y == k, 1, -1)
    
    # train classifier on this digit
    classifiers[k].fit(train_x, train_digit_y)
    
    # test this classifiers error with training data for in sample, and test data for out of sample
    eins[k] = 1 - classifiers[k].score(train_x, train_digit_y)
    eouts[k] = 1 - classifiers[k].score(test_x, test_digit_y)

In [3]:
print("Highest in sample error is {} versus all.".format(np.argmax(eins)))

Highest in sample error is 0 versus all.


__ A2. [a] 0 versus all __


Q3. With $C = 0.01$ and $Q = 2$, which of the following classifiers has the __lowest__ $E_\text{in}$?

In [4]:
print("Lowest in sample error is {} versus all.".format(np.argmin(eins)))

Lowest in sample error is 1 versus all.


__ A3. [a] 1 versus all __

Q4. Comparing the two selected classifiers from Problems 2 and 3, which of the following values is the closest to the difference between the number of support vectors of these two classifiers?

In [5]:
# comparing # of support vectors between 1vAll and 0vAll classifiers
np.abs(np.sum(classifiers[0].n_support_) - np.sum(classifiers[1].n_support_))

1793

__ A4. [c] 1800 __

Q5. Consider the 1 versus 5 classifier with $Q = 2$ and $C \in \{0.001, 0.01, 0.1, 1\}$. Which of the following statements is correct? Going up or down means strictly so.

In [6]:
def fiveOneClassify(Q):
    '''create series of five one classifiers using all given Cs and parameter Q '''
    # soft margin values to test
    C = [0.0001, 0.001, 0.01, 0.1, 1]

    # get data only when digit = 5 or digit = 7
    train_mask = np.where((train_y == 1) | (train_y == 5))
    test_mask = np.where((test_y == 1) | (test_y == 5))

    # get only features for digits 1 and 5, map 1 to +1, 5 to -1
    train_five_one_x = train_x[train_mask]
    train_five_one_y = np.where(train_y[train_mask] == 1, 1, -1)

    test_five_one_x = test_x[test_mask]
    test_five_one_y = np.where(test_y[test_mask] == 1, 1, -1)


    # dictionary for holding results
    results_dict = {}

    for i,c in enumerate(C):
        # initialize classifier
        five_one = svm.SVC(C = c, kernel = 'poly', degree = Q, gamma = 1.0, coef0=1)

        # train classifier
        five_one.fit(train_five_one_x, train_five_one_y)

        # number of support vectors for this c
        supports = five_one.n_support_[0]

        # score training data for ein
        ein = 1 - five_one.score(train_five_one_x, train_five_one_y)

        # score testing data for eout
        eout = 1 - five_one.score(test_five_one_x, test_five_one_y)

        results_dict[c] = [supports, ein, eout]

    # put result in pandas dataframe and print table 
    results = pd.DataFrame.from_dict(results_dict, orient = 'index')
    results.columns = ["Support Vectors", "In Sample Error", "Out of Sample Error"]
    results.sort_index(inplace=True)
    return results

fiveOneClassify(2)

Unnamed: 0,Support Vectors,In Sample Error,Out of Sample Error
0.0001,118,0.008969,0.016509
0.001,38,0.004484,0.016509
0.01,17,0.004484,0.018868
0.1,12,0.004484,0.018868
1.0,12,0.003203,0.018868


__ A5. [d] Maximum $C$ achieves the lowest $E_{\text{in}}$ __

Q6. In the 1 versus 5 classifier, comparing $Q = 2$ with $Q = 5$, which of the following
statements is correct?

In [7]:
fiveOneClassify(5)

Unnamed: 0,Support Vectors,In Sample Error,Out of Sample Error
0.0001,13,0.004484,0.018868
0.001,13,0.004484,0.021226
0.01,12,0.003844,0.021226
0.1,14,0.003203,0.018868
1.0,11,0.003203,0.021226


__ A6. [b] When $C = 0.001$, the number of support vectors is lower at $Q = 5$. __

In the next two problems, we will experiment with 10-fold cross validation for the polynomial kernel. Because $E_{\text{CV}}$ is a rnadom variable that depends on the random partition of the data, we will try $100$ runs with different partitions and base our answer on how many runs lead to a particular choice. 

Q7. Consider the $1$ versus $5$ classifier with $Q = 2$. We use $E_{\text{CV}}$ to select $C \in \{0.0001, 0.001, 0.01, 0.1, 1 \}$. If there is a tie in $E_{CV}$, select the smaller $C$. Within the $100$ random runs, which of the following statements is correct? 

In [25]:
''' test of manually doing cross validation for completion purposes
    in practice, would at least be using the parallelism of scikit's cross-validation
    (shown below after Problems 7 and 8)
'''
# get data only when digit = 5 or digit = 7
train_mask = np.where((train_y == 1) | (train_y == 5))
test_mask = np.where((test_y == 1) | (test_y == 5))

# get only features for digits 1 and 5, map 1 to +1, 5 to -1
train_five_one_x = train_x[train_mask]
train_five_one_y = np.where(train_y[train_mask] == 1, 1, -1)

def test(end = 1):    
    # C values to cross-validate
    C = [0.0001, 0.001, 0.01, 0.1, 1.0]
    
    # dict to store times that a given C yields best cross validation error
    c_cv = dict(zip(C, [0]*len(C)))
    
    # store best cross validation errors for each run
    ecvs = np.empty(100)
    
    for j in range(end):
        # splitting the data into 10 random partitions 
        indices = np.arange(train_five_one_x.shape[0])

        # shuffle indices
        np.random.shuffle(indices)

        # split into partitions of almost 10
        partitions_x = np.array_split(train_five_one_x[indices], 10)
        partitions_y = np.array_split(train_five_one_y[indices], 10)
        # using every possible partition as a validation set
        
        for k in range(len(partitions_x)):
            validation_x = partitions_x[k]
            validation_y = partitions_y[k]

            # get all data except validation set, stack back into numpy array
            cross_train_x = np.vstack(partitions_x[:k] + partitions_x[k+1:])
            cross_train_y = np.hstack(partitions_y[:k] + partitions_y[k+1:])


            # keeping track of min error per run
            min_error = float('inf')
            min_c = 0

            for c in C:
                # initialize model
                five_one = svm.SVC(C = c, kernel = 'poly', degree = 2, gamma = 1.0, coef0=1)

                # train
                five_one.fit(cross_train_x, cross_train_y)

                # get validation error
                val_error = 1 - five_one.score(validation_x, validation_y)

                # if this is smallest error seen so far, update mins
                if val_error < min_error:
                    min_error = val_error
                    min_c = c

            # update best c
            c_cv[min_c] += 1
            ecvs[j] = min_error


    return (c_cv, ecvs)
c_cv, ecvs = test(100)
print(c_cv)
print(np.average(ecvs))

{0.1: 23, 0.0001: 361, 0.01: 47, 0.001: 524, 1.0: 45}
0.0049358974359


__ A7. [b] C = 0.001 is selected most often __

Q8. Again, consider the 1 versus 5 classifier with $Q = 2$. For the winning selection in the previous problem, the average value of $E_{CV}$ over the $100$ runs is closest to, 

__ A8. [c] 0.005 __

Consider the radial basis function (RBF) kernel $K(\mathbf{x}_n, \mathbf{x}_m) = \exp(-||\mathbf{x}_n - \mathbf{x}_m ||^2)$ in the soft-margin SVM approach. Focus on the 1 versus 5 classifier. 

Q9. Which of the following values of C results in the lowest $E_{\text{in}}$? 

In [34]:
# get data only when digit = 5 or digit = 7
train_mask = np.where((train_y == 1) | (train_y == 5))
test_mask = np.where((test_y == 1) | (test_y == 5))

# get only features for digits 1 and 5, map 1 to +1, 5 to -1
train_five_one_x = train_x[train_mask]
train_five_one_y = np.where(train_y[train_mask] == 1, 1, -1)

# do the same for testing data
test_five_one_x = test_x[test_mask]
test_five_one_y = np.where(test_y[test_mask] == 1, 1, -1)


# new C values to test for RBF kernel
C2 = [0.01, 1, 100, 10000, 1000000]

rbf_eins = np.empty(len(C2))
rbf_eouts = np.empty(len(C2))

for i,c in enumerate(C2):
    five_one_rbf = svm.SVC(C = c, kernel = 'rbf', gamma = 1.0, coef0= 1.0)
    
    five_one_rbf.fit(train_five_one_x, train_five_one_y)
    
    rbf_eins[i] = 1 - five_one_rbf.score(train_five_one_x, train_five_one_y)
    
    rbf_eouts[i] = 1 - five_one_rbf.score(test_five_one_x, test_five_one_y)

print("Lowest Ein from C = {}.".format(C2[np.argmin(rbf_eins)]))
print("Lowest Eout from C = {}.".format(C2[np.argmin(rbf_eouts)]))



Lowest Ein from C = 1000000.
Lowest Eout from C = 100.


__ A9. [e] $C = 10^6$ __

Q10. Which of the following values of $C$ results in the lowest $E_{\text{out}}$?

__ A10. [a] C = 0.01 __

In [18]:
from sklearn.model_selection import RepeatedKFold, cross_val_score

# get data only when digit = 5 or digit = 7
train_mask = np.where((train_y == 1) | (train_y == 5))
test_mask = np.where((test_y == 1) | (test_y == 5))

# get only features for digits 1 and 5, map 1 to +1, 5 to -1
train_five_one_x = train_x[train_mask]
train_five_one_y = np.where(train_y[train_mask] == 1, 1, -1)


def parallelCrossValidation():
    ''' more efficient version of the above making use of multiple cores while cross validating'''
    scores = []
    C = [0.0001, 0.001, 0.01, 0.1, 1.0]
    for c in C:
        # kfold cross validation generator
        kf = RepeatedKFold(n_splits = 10, n_repeats=100)

        five_one = svm.SVC(C = c, kernel = 'poly', degree = 2, gamma = 1.0, coef0=1.0)

        scores.append(cross_val_score(five_one, train_five_one_x, train_five_one_y, cv = kf, n_jobs = -1))
        

    return scores

[array([ 0.98726115,  1.        ,  1.        ,  0.98076923,  0.99358974,
         0.98717949,  0.98076923,  0.98717949,  0.99358974,  0.99358974,
         0.99363057,  1.        ,  0.98717949,  0.98717949,  0.98717949,
         1.        ,  0.99358974,  0.98717949,  0.99358974,  0.98076923,
         1.        ,  0.99358974,  0.98717949,  1.        ,  0.97435897,
         0.99358974,  0.99358974,  0.98717949,  0.99358974,  0.98076923,
         0.98726115,  0.99358974,  0.98717949,  0.98717949,  0.99358974,
         0.97435897,  1.        ,  1.        ,  0.99358974,  0.98717949,
         0.98089172,  0.99358974,  1.        ,  1.        ,  0.96794872,
         0.98717949,  0.98717949,  1.        ,  0.99358974,  0.98717949,
         1.        ,  0.99358974,  0.97435897,  0.99358974,  0.98717949,
         0.98076923,  0.99358974,  0.98076923,  0.99358974,  1.        ,
         0.99363057,  0.99358974,  0.99358974,  0.99358974,  0.99358974,
         1.        ,  0.98717949,  0.96794872,  0.9

In [51]:
svm.SVC(C = c, kernel = 'rbf', gamma = 1.0, coef0 = 1.0 )

SVC(C=1.0, cache_size=200, class_weight=None, coef0=1.0,
  decision_function_shape=None, degree=3, gamma=1.0, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)