<center>
<h2>Deep Learning using Linear Support Vector Machines</h2>
<p>by <i>Yichuan Tang</i></p>
<a href="http://arxiv.org/abs/1306.0239">http://arxiv.org/abs/1306.0239</a>
</center>
<p><b>Abstract: </b>Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics. For classification tasks, most of these "deep learning" models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, our results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.</p>

## Libraries

In [None]:
import tensorflow as tf
from keras.datasets import mnist, cifar10
from keras.layers import Dense, Dropout, Input, Conv2D, MaxPooling2D, Flatten, BatchNormalization
from keras.layers import GaussianNoise
from keras.models import Model
from tensorflow.keras.optimizers import RMSprop
import tensorflow.keras.backend as K
from tensorflow.keras.optimizers.schedules import ExponentialDecay
import numpy as np
from keras.models import Sequential
from keras.regularizers import l2
from keras.preprocessing.image import ImageDataGenerator

## SVM Loss Function

The following loss fuction is based on the presented formula in [this course](https://cs231n.github.io/linear-classify/#multiclass-support-vector-machine-loss).

<center>
<img src="https://drive.google.com/uc?export=view&id=1ioDXTGPsD9vzbcVkBwQBOqa2CTw9gxa8"
</center>

In [None]:
def svm_loss(layer, reg_weight=1, loss_weight=1):
    weights = layer.weights[0]
    weights_tf = tf.convert_to_tensor(weights)
    
    def squared_categorical_hinge_loss(y_true, y_pred):
        pos = K.sum(y_true * y_pred, axis=-1)
        neg = K.max((1.0 - y_true) * y_pred, axis=-1)
        hinge_loss = K.mean(K.square(K.maximum(0.0, 1.0 - pos + neg)), axis=-1)
        regularization_loss = tf.reduce_sum(tf.square(weights_tf))
        return reg_weight*regularization_loss + loss_weight*hinge_loss
    
    return squared_categorical_hinge_loss

# MNIST

## Dataset

In [None]:
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
num_classes = 10

# convert class vectors to binary class matrices
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

In [None]:
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255.
x_test /= 255.
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

60000 train samples
10000 test samples


## PCA

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scalar = StandardScaler()
scaled_x_train = scalar.fit_transform(x_train)
scaled_x_test = scalar.transform(x_test)

In [None]:
pca = PCA(n_components = 70)
pca.fit(scaled_x_train)
x_train = pca.transform(scaled_x_train)
print('x_train pca shape:', x_train.shape)
x_test = pca.transform(scaled_x_test)
print('x_test pca shape:', x_test.shape)

x_train pca shape: (60000, 70)
x_test pca shape: (10000, 70)


## SVM Model

The original paper suggested a fully conneted model with the following specification:


*   Two hidden layer of 512 units
*   300 minibatches of 200 samples each
*   Stocastic gradiant descent wit momentum
*   Learning rate is linearly decayed from 0.1 to 0.0
*   A lot of Gaussian noise is added to the input. Noise of standard deviation of 1.0.

But the problems of the paper are:


*   It didn't specified the momentum value.
*   The batch size is better to be a power of two (like 256).
*   And most importantly it didn't specified the penalty value of hinge loss.





In [None]:
model = Sequential()
model.add(GaussianNoise(1.0, input_shape=(x_train.shape[1],)))
model.add(Dense(512, activation='relu'))
model.add(Dense(512, activation='relu'))
model.add(Dense(10, use_bias=False, activation='tanh', name='svm'))

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.0005,
                                    decay=0.0005/400,
                                    momentum=0.9)
model.compile(optimizer = optimizer,
              loss = svm_loss(model.get_layer('svm'), 0.5, 0.5),
              metrics = ['accuracy'])

In [None]:
batch_size = 256
epochs = 400

history = model.fit(x_train, y_train,
                    batch_size = batch_size,
                    epochs = epochs,
                    verbose = 1,
                    validation_data = (x_test, y_test))

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78

In [None]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 10.02246379852295
Test accuracy: 0.9783999919891357


In [None]:
print('Best validation accuracy:', max(history.history['val_accuracy']))

Best validation accuracy: 0.9785000085830688


## Softmax Model

In [None]:
model = Sequential()
model.add(GaussianNoise(1.0, input_shape=(x_train.shape[1],)))
model.add(Dense(512, activation='relu'))
model.add(Dense(512, activation='relu'))
model.add(Dense(10, use_bias=False, activation='softmax', kernel_regularizer=l2(0.0001), name='softmax'))

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.0005,
                                    decay=0.0005/400,
                                    momentum=0.9)
model.compile(optimizer = optimizer,
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])

In [None]:
batch_size = 256
epochs = 400

history = model.fit(x_train, y_train,
                    batch_size = batch_size,
                    epochs = epochs,
                    verbose = 1,
                    validation_data = (x_test, y_test))

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78

In [None]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.07764852046966553
Test accuracy: 0.9782000184059143


In [None]:
print('Best validation accuracy:', max(history.history['val_accuracy']))

Best validation accuracy: 0.9785000085830688


# CIFAR

## Dataset

In [None]:
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

In [None]:
num_classes = 10

# convert class vectors to binary class matrices
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

In [None]:
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255.
x_test /= 255.
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

50000 train samples
10000 test samples


## Data Augmentation

In [None]:
datagen = ImageDataGenerator(#zoom_range=0.1,
                             #width_shift_range=0.1,
                             #height_shift_range=0.1,
                             horizontal_flip=True)
it_train = datagen.flow(x_train, y_train)

## SVM Model

The original paper suggested a convolutional network with the following specification:


*   A Convolutional Network consisting of two convolutional layer
*   The first convolutional layer had 32 5 × 5 filters with ReLU activation function
*   The second convolutional layer had 64 5 × 5 filters with ReLU activation function
*   Both pooling layers used max pooling and downsampled by a factor of 2
*   The penultimate layer has 3072 hidden nodes and uses ReLU activation with a dropout rate of 0.2
*   Horizontal reflection and jitter is applied to the data randomly
Minibatch of 128 data cases


But the problems of the paper are:


*   It didn’t specified the optimization algorithm, learning rate, etc.
*   It didn’t say anything about the number of epochs.
*   And most importantly it didn’t specified the regularization weight value (𝜆).






In [None]:
model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape=(32, 32, 3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (5, 5), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(3072, activation='relu', name='penultimate'))
model.add(Dropout(0.2))
model.add(Dense(10, use_bias=False, activation='tanh', name='svm'))

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer = optimizer,
              loss = svm_loss(model.get_layer('svm'), 0.2, 0.8),
              metrics = ['accuracy'])

In [None]:
batch_size = 128
epochs = 50

history = model.fit(it_train,
                    batch_size = batch_size,
                    epochs = epochs,
                    verbose = 1,
                    validation_data = (x_test, y_test))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 4.900073528289795
Test accuracy: 0.7702000141143799


In [None]:
print('Best validation accuracy:', max(history.history['val_accuracy']))

Best validation accuracy: 0.7730000019073486


## Softmax Model

In [None]:
model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape=(32, 32, 3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (5, 5), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(3072, activation='relu', name='penultimate'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer=optimizer,
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])

In [None]:
batch_size = 128
epochs = 50

history = model.fit(it_train,
                    batch_size = batch_size,
                    epochs = epochs,
                    verbose = 1,
                    validation_data = (x_test, y_test))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 1.2834279537200928
Test accuracy: 0.7638000249862671


In [None]:
print('Best validation accuracy:', max(history.history['val_accuracy']))

Best validation accuracy: 0.7723000049591064
