# Final Project Report

##### Group Members:
Peter Klinkmueller, Archan Patel, Will Ye, Jason Zhang

## Introduction
For our project, we seek to compare the convergence rates of a number of descent algorithms for two different objectives, over multiple data sets, and utilizing three different learning rates.

`Will`

{ there was no where really else to put this TODO, so: pls comment the rest of the modules as JZ has comments so far for the descent_methods.py file - NOTE: you __don't__ have to do any commenting to util.py or for testing_mean.py - just the descent methods file, the learning rates file, and the models file - use the same formatting as JZ's existing comments, just for consistency, ty bb }

## Methods
In this section of our report, the different objectives, descent algorithms, and learning rates are discussed. 

* ### Objectives:
We implemented two different objectives for our project - logsitic regression and a support vector machine.

   __Logistic Regression__

   `Will`

   { insert latex for the loss function for logistic regression here (the formulation that we use in the code) }

   { insert latex for the gradient function for logistic regression here (again, the one from the code) }

   `Archan`

   { insert a brief discussion of logistic regression here...why we used it, what makes it nice, idk anything else you wanna say }
 
   __Support Vector Machine__

   `Will`

   { insert latex for the loss function for SVM here (the formulation that we use in the code) }

   { insert latex for the gradient function for SVM here (again, the one from the code) }

   `Archan`

   { insert a brief discussion of SVM here, same deal as for with logistic regression }

* ### Descent Methods
We implemented four different descent methods (with the objects taking in a batch size, which allowed for easy modification of gradient descent to be used to implement stochastic gradient descent).

   __Gradient Descent__
   This is our baseline descent method, with the implementation using the usual simple update rule.

   __Stochastic Gradient Descent__
   As noted above, we simply used the GradientDescent descent method object to run a model using stochastic gradient descent by    simply changing the batch size to be smaller than the number of training samples. We run mini-batches as well, testing       stochastic gradient descent for batch sizes: $\{1, 10, 100\}$.

   __Nesterov's Accelerated Gradient Descent__

   `Archan/JZ - whoever actually wrote this/knows it - you guys figure it out`

   { insert a brief discussion of how this descent method works, and what we expect to get out of it compared to GD }

   __Stochastic Variance Reduced Gradient Descent__
  
   `Peter`

   { insert a brief discussion of how this descent method works, and what we expect to get out of it compared to GD }

  __Mirror Descent__

   `JZ`

   { insert a brief discussion of how this descent method works, and what we expect to get out of it compared to GD }

* ### Learning Rates
We explored four different learning rates.
   __Fixed__ 
   The fixed learning rate simply sets and uses a constant $\eta$ for the step size throughout the model fit.
   __Square Root Decay__
   The square root decay we implemented is of the form:
   $$ \eta_i = \eta * \frac{1}{\sqrt{i}} $$
   where $\eta$ is a set parameter and $i$ is the iteration number.
   __Exponential Decay__
   The exponential decay we implemented is of the form:
   $$ \eta_i = \eta * \frac{1}{\gamma * i} ,$$
   where $\eta$ and $\gamma$ are set parameters and $i$ is the iteration number.

   We wanted to see if this has a better or worse accelerating effect on the convergence of the algorithms compared to the square root decay scheme.

* ### Relative Convergence Condition
In order to perform our analysis, we relay upon a relative convergence check performed on the loss, $L$ after each iteration of a descent algorithm. This check is of the form:
$$ \frac{| L_{i-1} - L_i |}{L_i} < rel\_conv $$
where $rel\_conv$ is our relative convergence condition, which we set as $rel\_conv = 0.000001$, and $L_i$ is our loss at iteration $i$.

   This allows us to run each model fit up until it reaches this convergence, thus enabling a direct comparison on convergence rate by analyzing the relative runtime of each algorithm. This also means that all of our algorithms will converge to very close to the same loss, and thus the same accuracy, for each objective function/data set pair.

## Implementation
Here, we discuss how we went about implementing our methods.

* ### Overview
`Will`

{ insert some general discussion of our goals for implementing our methods... modularization...plug-n-play (know you love that)...etc. }

* ### Objectives
`Will`

{ insert brief discussion of modularization of objective objects }

* ### Descent Methods
`Will`

{ insert brief discussion of modularization of descent method objects }

* ### Learning Rates
`Will`

{ insert brief discussion of modularization of learning rate objects }

## Libraries

We utilize both external libraries, like numpy and scikit-learn, as well as internally written libraries for the sake of modularity and simplicity of code within this notebook. The goal for modularizing the code base is so that running the different algorithms here can be clean and require as few parameters and extraneous code blocks as possible, enabling us to focus on analysis.

In [13]:
# External libraries:
import numpy as np
from sklearn.model_selection import train_test_split

# Internal libraries:
import datasets.data as data
from descent_algorithms import *
from learning_rates import *
from models import *
from util import *

## Data
We use three different datasets for our analysis of our algorithms, all of which provide a binary classification problem.

`Will`

{ insert discussion of the data sets - focusing on why they meet the needs of our problem and what process you had to go through to prep them to be usable for this purpose }

Here, we read in the data vectors and labels using the datasets/data utility functions, and then perform a train/test split of 80%/20% of the provided samples. The splitting is done using the train_test_split function from the sklearn.model_selection package, which randomizes the splits.

In [2]:
features, labels = data.load_wisconsin_breast_cancer()
wbc_X_train, wbc_X_test, wbc_y_train, wbc_y_test = train_test_split(
    features, labels, test_size=0.2)
wbc_n = wbc_X_train.shape[0]

M_features, M_labels = data.load_MNIST_13()
mnist_X_train, mnist_X_test, mnist_y_train, mnist_y_test = train_test_split(
    M_features, M_labels, test_size = 0.2)
mnist_n = mnist_X_train.shape[0]

cod_features, cod_labels = data.load_cod_rna()
cod_X_train, cod_X_test, cod_y_train, cod_y_test = train_test_split(
    cod_features, cod_labels, test_size = 0.2)
cod_n = cod_X_train.shape[0]

# Logistic Regression Analysis

### Setting Relative Convergence Condition

In [3]:
# relative convergence limit
rel_conv = 0.000001

### Wisconsin Breast Cancer Data Set
We begin our analysis with a look at the performance of our three $LearningRate$ types on the WBC data set for all of our descent algorithms, which we will abbreviate for the remainder of the analysis as: Gradient Descent (GD), Stochastic Gradient Descent (SGD), Nesterov's Accelerated Gradient Descent (AGD), Stochastic Variance Reduced Gradient Descent (SVRG), and Mirror Descent (MD).

#### Fixed Learning Rate


In [4]:
# initialize our learning rate object
lr = FixedRate(0.01)

We begin by instantiating our descent method objects:

In [5]:
# instantiate our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent() # the GD algorithm is used for all SGD algorithms, 
                          # with the smaller batch size specified in the model
sgd_10 = GradientDescent()
sgd_100 = GradientDescent()
agd = NesterovAcceleratedDescent()
svrg = StochasticVarianceReducedGradientDescent()
md = MirrorDescent()

Next, we initialize all of our model objects (all logistic regression models in this case), with the appropriate parameters for each algorithm.

In [6]:
# LogisticRegression(DescentAlgorithm, LearningRate, max iterations, batch size, relative convergence)
gd_log = LogisticRegression(gd, lr, 5000, wbc_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr, 2000, 1, rel_conv)
sgd_10_log = LogisticRegression(sgd_10, lr, 4000, 10, rel_conv)
sgd_100_log = LogisticRegression(sgd_100, lr, 4000, 100, rel_conv)
agd_log = LogisticRegression(agd, lr, 400, wbc_n, rel_conv)
svrg_log = LogisticRegression(svrg, lr, 20, wbc_n, rel_conv)
md_log = LogisticRegression(md, lr, 2000, wbc_n, rel_conv)

Then, we run the fit for each model (10 runs each, providing better results in the mean across the runs):

In [9]:
print('Fitting gradient descent:')
gd_loss, gd_time = gd_log.fit(wbc_X_train, wbc_y_train, True)
print('\nFitting stochastic gradient descent, batch size = 1:')
sgd_1_loss, sgd_1_time = sgd_1_log.fit(wbc_X_train, wbc_y_train, True)
print('\nFitting stochastic gradient descent, batch size = 10:')
sgd_10_loss, sgd_10_time = sgd_10_log.fit(wbc_X_train, wbc_y_train, True)
print('\nFitting stochastic gradient descent, batch size = 100:')
sgd_100_loss, sgd_100_time = sgd_100_log.fit(wbc_X_train, wbc_y_train, True)
print('\nFitting accelerated gradient descent:')
agd_loss, agd_time = agd_log.fit(wbc_X_train, wbc_y_train, True)
print('\nFitting stochastic variance reduced gradient descent:')
svrg_loss, svrg_time = svrg_log.fit(wbc_X_train, wbc_y_train, True)
# for i in range(0,wbc_svrg_loss.shape[0]*wbc_n):
#     if 
print('\nFitting mirror descent:')
md_loss, md_time = md_log.fit(wbc_X_train, wbc_y_train)

Fitting gradient descent:
Iter:        0 train loss: 375.002
Iter:      500 train loss: 225.254
Iter:     1000 train loss: 216.193
Iter:     1500 train loss: 213.715
Iter:     2000 train loss: 212.824
Iter:     2500 train loss: 212.458
Iter:     3000 train loss: 212.296
Converged in 3008 iterations.

Fitting stochastic gradient descent, batch size = 1:
Iter:        0 train loss: 407.684
Iter:      200 train loss: 274.196
Iter:      400 train loss: 249.848
Iter:      600 train loss: 222.718
Iter:      800 train loss: 220.777
Iter:     1000 train loss: 221.910
Iter:     1200 train loss: 277.182
Converged in 1266 iterations.

Fitting stochastic gradient descent, batch size = 10:
Iter:        0 train loss: 379.870
Iter:      400 train loss: 229.315
Iter:      800 train loss: 218.308
Iter:     1200 train loss: 215.550
Iter:     1600 train loss: 213.982
Iter:     2000 train loss: 212.983
Iter:     2400 train loss: 213.703
Iter:     2800 train loss: 212.829
Iter:     3200 train loss: 212.374


  return np.dot(-y.T, np.log(h)) - np.dot((1 - y).T,np.log(1 - h))


Iter:      160 train loss: nan
Iter:      200 train loss: nan
Iter:      240 train loss: nan
Iter:      280 train loss: nan
Iter:      320 train loss: nan
Iter:      360 train loss: nan

Fitting stochastic variance reduced gradient descent:
Iter:        0 train loss: 240.360
Iter:        2 train loss: 217.426
Iter:        4 train loss: 212.961
Iter:        6 train loss: 212.175
Iter:        8 train loss: 212.150
Iter:       10 train loss: 212.149
Converged in 10 iterations.

Fitting mirror descent:
Converged in 38 iterations.


In [8]:
acc = check_accuracy(gd_log, wbc_X_test, wbc_y_test)
print("GD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, wbc_X_test, wbc_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_10_log, wbc_X_test, wbc_y_test)
print("SGD 10 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_100_log, wbc_X_test, wbc_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(agd_log, wbc_X_test, wbc_y_test)
print("AGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(svrg_log, wbc_X_test, wbc_y_test)
print("SVRG Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))

GD Accuracy: 84.29%
SGD 1 Accuracy: 87.14%
SGD 10 Accuracy: 84.29%
SGD 100 Accuracy: 85.71%
AGD Accuracy: 85.00%
SVRG Accuracy: 84.29%
MD Accuracy: 84.29%


Then, we re-run this fitting for each of the algorithms nine more times in order to better smooth our expected loss results:

In [11]:
gd_loss_counts = np.sign(gd_loss)
sgd_1_loss_counts = np.sign(sgd_1_loss)
sgd_10_loss_counts = np.sign(sgd_10_loss)
sgd_100_loss_counts = np.sign(sgd_100_loss)
agd_loss_counts = np.sign(agd_loss)
svrg_loss_counts = np.sign(svrg_loss)
md_loss_counts = np.sign(md_loss)

for i in range(0,99):
    gd = GradientDescent()
    sgd_1 = GradientDescent()
    sgd_10 = GradientDescent()
    sgd_100 = GradientDescent()
    agd = NesterovAcceleratedDescent()
    svrg = StochasticVarianceReducedGradientDescent()
    md = MirrorDescent()
    gd_log = LogisticRegression(gd, lr, 5000, wbc_n, rel_conv)
    sgd_1_log = LogisticRegression(sgd_1, lr, 2000, 1, rel_conv)
    sgd_10_log = LogisticRegression(sgd_10, lr, 4000, 10, rel_conv)
    sgd_100_log = LogisticRegression(sgd_100, lr, 4000, 100, rel_conv)
    agd_log = LogisticRegression(agd, lr, 400, wbc_n, rel_conv)
    svrg_log = LogisticRegression(svrg, lr, 20, wbc_n, rel_conv)
    md_log = LogisticRegression(md, lr, 2000, wbc_n, rel_conv)
    
    tmp_loss, tmp_time = gd_log.fit(wbc_X_train, wbc_y_train)
    gd_loss_counts += np.sign(tmp_loss)
    gd_loss += tmp_loss
    gd_time += tmp_time
    tmp_loss, tmp_time = sgd_1_log.fit(wbc_X_train, wbc_y_train)
    sgd_1_loss_counts += np.sign(tmp_loss)
    sgd_1_loss += tmp_loss
    sgd_1_time += tmp_time
    tmp_loss, tmp_time = sgd_10_log.fit(wbc_X_train, wbc_y_train)
    sgd_10_loss_counts += np.sign(tmp_loss)
    sgd_10_loss += tmp_loss
    sgd_10_time += tmp_time
    tmp_loss, tmp_time = sgd_100_log.fit(wbc_X_train, wbc_y_train)
    sgd_100_loss_counts += np.sign(tmp_loss)
    sgd_100_loss += tmp_loss
    sgd_100_time += tmp_time
    tmp_loss, tmp_time = agd_log.fit(wbc_X_train, wbc_y_train)
    agd_loss_counts += np.sign(tmp_loss)
    agd_loss += tmp_loss
    agd_time += tmp_time
    tmp_loss, tmp_time = svrg_log.fit(wbc_X_train, wbc_y_train)
    svrg_loss_counts += np.sign(tmp_loss)
    svrg_loss += tmp_loss
    svrg_time += tmp_time
    tmp_loss, tmp_time = md_log.fit(wbc_X_train, wbc_y_train)
    md_loss_counts += np.sign(tmp_loss)
    md_loss += tmp_loss
    md_time += tmp_time
gd_loss /= gd_loss_counts
sgd_1_loss /= sgd_1_loss_counts
sgd_10_loss /= sgd_10_loss_counts
sgd_100_loss /= sgd_100_loss_counts
agd_loss /= agd_loss_counts
svrg_loss /= svrg_loss_counts
md_loss /= md_loss_counts
gd_time /= 100
sgd_1_time /= 100
sgd_10_time /= 100
sgd_100_time /= 100
agd_time /= 100
svrg_time /= 100
md_time /= 100

Converged in 3008 iterations.
Converged in 650 iterations.
Converged in 2489 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 896 iterations.
Converged in 2617 iterations.
Converged in 2087 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 1283 iterations.
Converged in 1398 iterations.
Converged in 131 iterations.
Converged in 10 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 834 iterations.
Converged in 2690 iterations.
Converged in 467 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 727 iterations.
Converged in 393 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 1054 iterations.
Conve

Converged in 861 iterations.
Converged in 131 iterations.
Converged in 7 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 890 iterations.
Converged in 659 iterations.
Converged in 131 iterations.
Converged in 5 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 1080 iterations.
Converged in 1562 iterations.
Converged in 202 iterations.
Converged in 131 iterations.
Converged in 9 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 784 iterations.
Converged in 835 iterations.
Converged in 131 iterations.
Converged in 7 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 1320 iterations.
Converged in 547 iterations.
Converged in 675 iterations.
Converged in 131 iterations.
Converged in 7 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 814 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 

Converged in 1237 iterations.
Converged in 2427 iterations.
Converged in 592 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 736 iterations.
Converged in 963 iterations.
Converged in 2579 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 674 iterations.
Converged in 573 iterations.
Converged in 131 iterations.
Converged in 10 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 734 iterations.
Converged in 2744 iterations.
Converged in 954 iterations.
Converged in 131 iterations.
Converged in 7 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converged in 1409 iterations.
Converged in 523 iterations.
Converged in 215 iterations.
Converged in 131 iterations.
Converged in 8 iterations.
Converged in 1665 iterations.
Converged in 3008 iterations.
Converge

In [12]:
gd_loss = gd_loss[gd_loss.nonzero()]
sgd_1_loss = sgd_1_loss[sgd_1_loss.nonzero()]
sgd_10_loss = sgd_10_loss[sgd_10_loss.nonzero()]
sgd_100_loss = sgd_100_loss[sgd_100_loss.nonzero()]
agd_loss = agd_loss[agd_loss.nonzero()]
svrg_loss = svrg_loss[svrg_loss.nonzero()]
md_loss = md_loss[md_loss.nonzero()]
print(gd_time)
print(sgd_1_time)
print(sgd_10_time)
print(sgd_100_time)
print(agd_time)
print(svrg_time)
print(md_time)

3.89495224237442
0.5146340107917786
0.5572686982154846
0.30148998737335203
0.12963653802871705
0.8682884550094605
1.2997781491279603


Then, we can plot the run-averaged losses for each algorithm:

In [None]:
plot_fixed_losses(gd_loss, sgd_1_loss, sgd_10_loss, sgd_100_loss, agd_loss, svrg_loss, md_loss)

In [None]:
# initialize our learning rate object
lr_gd = ExpDecayRate(0.1, 0.0001)
lr_sgd = ExpDecayRate(0.01, 0.00001)
lr_md = ExpDecayRate(0.1, 0.00001)
# initialize our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = LogisticRegression(gd, lr_gd, 2000, wbc_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr_sgd, 2000, 1, rel_conv)
md_log = LogisticRegression(md, lr_md, 2000, wbc_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_1_loss = sgd_1_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)
# print the test accuracies for each model
acc = check_accuracy(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, wbc_X_test, wbc_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_losses(wbc_gd_loss, wbc_sgd_1_loss, wbc_md_loss)

In [None]:
# initialize our learning rate object
# lr_gd = ExpDecayRate(0.1, 0.001)
# lr_sgd = ExpDecayRate(0.01, 0.001)
# lr_md = ExpDecayRate(0.1, 0.001)
lr_gd = SqrtDecayRate(0.001,1)
lr_sgd = SqrtDecayRate(0.0001,1)
lr_md = SqrtDecayRate(0.001,1)
# initialize our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = LogisticRegression(gd, lr_gd, 2000, wbc_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr_sgd, 2000, 1, rel_conv)
md_log = LogisticRegression(md, lr_md, 2000, wbc_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_1_loss = sgd_1_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)
# print the test accuracies for each model
acc = check_accuracy(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, wbc_X_test, wbc_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_losses(wbc_gd_loss, wbc_sgd_1_loss, wbc_md_loss)

#### MNIST Data
Then, we setup the run for all of our descent methods on the MNIST dataset, beginning with the initialization of each of our descent method objects. We combine the cells here and reduce the footprint, as the usage is the same as above.

In [None]:
lr = FixedRate(0.000001)
# initialize our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent() 
sgd_10 = GradientDescent()
sgd_100 = GradientDescent()
agd = NesterovAcceleratedDescent()
svrg = StochasticVarianceReducedGradientDescent()
md = MirrorDescent()
# initialize the logisitic regression objects
gd_log = LogisticRegression(gd, lr, 5000, mnist_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr, 5000, 1, rel_conv)
sgd_10_log = LogisticRegression(sgd_10, lr, 5000, 10, rel_conv)
sgd_100_log = LogisticRegression(sgd_100, lr, 5000, 100, rel_conv)
agd_log = LogisticRegression(agd, lr, 2000, mnist_n, rel_conv)
svrg_log = LogisticRegression(svrg, lr, 40, mnist_n, rel_conv)
md_log = LogisticRegression(md, lr, 3000, mnist_n, rel_conv)
# and run the fit for each of these models, this time on the MNIST data set:
print('Fitting gradient descent:')
mnist_gd_loss = gd_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
mnist_sgd_1_loss = sgd_1_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting stochastic gradient descent, batch size = 10:')
mnist_sgd_10_loss = sgd_10_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting stochastic gradient descent, batch size = 100:')
mnist_sgd_100_loss = sgd_100_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting accelerated gradient descent:')
mnist_agd_loss = agd_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting stochastic variance reduced gradient descent:')
mnist_svrg_loss = svrg_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting mirror descent:')
mnist_md_loss = md_log.fit(mnist_X_train, mnist_y_train)
# displaying accuracies
acc = check_accuracy(gd_log, mnist_X_test, mnist_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, mnist_X_test, mnist_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_10_log, mnist_X_test, mnist_y_test)
print("SGD 10 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_100_log, mnist_X_test, mnist_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(agd_log, mnist_X_test, mnist_y_test)
print("AGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(svrg_log, mnist_X_test, mnist_y_test)
print("SVRG Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, mnist_X_test, mnist_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot losses
plot_losses(mnist_gd_loss, mnist_sgd_1_loss, mnist_sgd_10_loss, 
            mnist_sgd_100_loss, mnist_agd_loss, mnist_svrg_loss, 
            mnist_md_loss)

In [None]:
# initialize our learning rate object
lr_gd = PolyDecayRate(0.01, 0.0001)
lr_sgd = PolyDecayRate(0.01, 0.00001)
lr_md = PolyDecayRate(0.1, 0.00001)
# initialize our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = LogisticRegression(gd, lr_gd, 2000, mnist_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr_sgd, 2000, 1, rel_conv)
md_log = LogisticRegression(md, lr_md, 2000, mnist_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
mnist_gd_loss = gd_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
mnist_sgd_1_loss = sgd_1_log.fit(mnist_X_train, mnist_y_train)
print('\nFitting mirror descent:')
mnist_md_loss = md_log.fit(mnist_X_train, mnist_y_train)
# print the test accuracies for each model
acc = check_accuracy(gd_log, mnist_X_test, mnist_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, mnist_X_test, mnist_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, mnist_X_test, mnist_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_losses(mnist_gd_loss, mnist_sgd_1_loss, mnist_md_loss)

### COD-RNA Data
Lastly, we setup the run for all of our descent methods on the COD-RNA dataset, again using a reduced-frill cell to run our fit for each method's model.

In [None]:
lr = FixedRate(0.00001)
# initialize our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent()
sgd_10 = GradientDescent()
sgd_100 = GradientDescent()
agd = NesterovAcceleratedDescent()
svrg = StochasticVarianceReducedGradientDescent()
md = MirrorDescent()
# initialize the logisitic regression objects
gd_log = LogisticRegression(gd, lr, 5000, cod_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr, 4000, 1, rel_conv)
sgd_10_log = LogisticRegression(sgd_10, lr, 4000, 10, rel_conv)
sgd_100_log = LogisticRegression(sgd_100, lr, 4000, 100, rel_conv)
agd_log = LogisticRegression(agd, lr, 200, cod_n, rel_conv)
svrg_log = LogisticRegression(svrg, lr, 20, cod_n, rel_conv)
md_log = LogisticRegression(md, lr, 2000, cod_n, rel_conv)
# and run the fit for each of these models, this time on the MNIST data set:
print('Fitting gradient descent:')
cod_gd_loss = gd_log.fit(cod_X_train, cod_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
cod_sgd_1_loss = sgd_1_log.fit(cod_X_train, cod_y_train)
print('\nFitting stochastic gradient descent, batch size = 10:')
cod_sgd_10_loss = sgd_10_log.fit(cod_X_train, cod_y_train)
print('\nFitting stochastic gradient descent, batch size = 100:')
cod_sgd_100_loss = sgd_100_log.fit(cod_X_train, cod_y_train)
print('\nFitting accelerated gradient descent:')
cod_agd_loss = agd_log.fit(cod_X_train, cod_y_train)
print('\nFitting stochastic variance reduced gradient descent:')
cod_svrg_loss = svrg_log.fit(cod_X_train, cod_y_train)    
print('\nFitting stochastic variance reduced gradient descent:')
cod_md_loss = md_log.fit(cod_X_train, cod_y_train)
# displaying accuracies
acc = check_accuracy(gd_log, cod_X_test, cod_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, cod_X_test, cod_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_10_log, cod_X_test, cod_y_test)
print("SGD 10 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_100_log, cod_X_test, cod_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(agd_log, cod_X_test, cod_y_test)
print("AGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(svrg_log, cod_X_test, cod_y_test)
print("SVRG Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, cod_X_test, cod_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot losses
plot_fixed_losses(cod_gd_loss, cod_sgd_1_loss, cod_sgd_10_loss, 
            cod_sgd_100_loss, cod_agd_loss, cod_svrg_loss, 
            cod_md_loss)

In [None]:
# initialize our learning rate object
lr_gd = PolyDecayRate(0.0001, 0.0001)
lr_sgd = PolyDecayRate(0.00001, 0.00001)
lr_md = PolyDecayRate(0.0001, 0.00001)
# initialize our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = LogisticRegression(gd, lr_gd, 1000, cod_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr_sgd, 1500, 1, rel_conv)
md_log = LogisticRegression(md, lr_md, 1000, cod_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
cod_gd_loss = gd_log.fit(cod_X_train, cod_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
cod_sgd_1_loss = sgd_1_log.fit(cod_X_train, cod_y_train)
print('\nFitting mirror descent:')
cod_md_loss = md_log.fit(cod_X_train, cod_y_train)
# print the test accuracies for each model
acc = check_accuracy(gd_log, cod_X_test, cod_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, cod_X_test, cod_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, cod_X_test, cod_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_losses(cod_gd_loss, cod_sgd_1_loss, cod_md_loss)

In [None]:
# initialize our learning rate object
lr_gd = SqrtDecayRate(0.0001, 1.)
lr_sgd = SqrtDecayRate(0.00001, 1.)
lr_md = SqrtDecayRate(0.0001, 1.)
# initialize our descent methods
gd = GradientDescent()
sgd_1 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = LogisticRegression(gd, lr_gd, 2000, cod_n, rel_conv)
sgd_1_log = LogisticRegression(sgd_1, lr_sgd, 4000, 1, rel_conv)
md_log = LogisticRegression(md, lr_md, 4000, cod_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(cod_X_train, cod_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_1_loss = sgd_1_log.fit(cod_X_train, cod_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(cod_X_train, cod_y_train)
# print the test accuracies for each model
acc = check_accuracy(gd_log, cod_X_test, cod_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(sgd_1_log, cod_X_test, cod_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy(md_log, cod_X_test, cod_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_losses(wbc_gd_loss, wbc_sgd_1_loss, wbc_md_loss)

### Exponential Decaying Learning Rate
We begin our analysis with a look at the polynomial decaying learning rate convergence for our GD, SGD, AGD, and SVRG algorithms on our three datasets.

The default learning rate for fixed is set to 0.01, and a gamma value of 0.0001.

#### Wisconsin Breast Cancer Data
Then, we setup the run for all of our descent methods on the Wisconsin Breast Cancer dataset, beginning with the initialization of each of our descent method objects.

#### COD-RNA Data
Lastly, we setup the run for all of our descent methods on the COD-RNA dataset, again using a reduced-frill cell to run our fit for each method's model.

### Square-Root Decaying Learning Rate
Lastly, we analyze an exponentially decaying learning rate for convergence for our GD, SGD, AGD, and SVRG algorithms on our three datasets.

The default learning rate for fixed is set to 0.01, and a gamma value of 0.0001.

#### Wisconsin Breast Cancer Data
Then, we setup the run for all of our descent methods on the Wisconsin Breast Cancer dataset, beginning with the initialization of each of our descent method objects.

Run notes:

FIRST:
lr:
    - fixed: GD(0.01), SGD(0.01) - batched at 1,10,100 , SVRG(0.01), Nest(0.01)
    - polydecay: GD(0.01, 0.0001), SGD(0.01, 0.00001) - batched, SVRG(N/A), Nest(N/A)
    - expdecay: GD(0.1,0.001), SGD(0.1,.001) - batched, SVRG(N/A), Nest(N/A)

# SVM Analysis

Now, we will carry out our analysis of the SVM objective.

We will perform largely the same analysis as with out logistic regression test, however, we will use a mini-batch SGD algorithm for the dynamic, decaying learning rates instead of the single SGD algorithm.

Again, a relative convergence of $rel\_conv = 0.000001$ will be used in order to compare the relative convergence rates of the different descent methods.

The hyperparameter $c$ was found to value $0.00001$. This parameter is ...
## ARCHAN 
{ pls insert a brief explanation for what the heck this 'c' is / what is does }

In [None]:
c = 0.00001

Before we can begin running our analysis, however, we need to convert our data sets' labels to be $[-1,1]$, rather than how they are currently configured as $[0,1]$. This is accomplished by running the train and test split label vectors through a simply utility function we have written for this purpose.

In [None]:
# label conversion for the Wisconsin Breast Cancer data set
wbc_y_train = zero_one_labels_to_signed(wbc_y_train)
wbc_y_test = zero_one_labels_to_signed(wbc_y_test)

# label conversion for the MNIST binarized data set
mnist_y_train = zero_one_labels_to_signed(mnist_y_train)
mnist_y_test = zero_one_labels_to_signed(mnist_y_test)

# lavel conversion for the COD-RNA data set
cod_y_train = zero_one_labels_to_signed(cod_y_train)
cod_y_test = zero_one_labels_to_signed(cod_y_test)

As with Logistic Regression, we begin with observing the performance of the SVM objective across the different algorithms for the different learning rate paradigms investigated, and then across the three data sets. The configuration of the executed cells is the same as with the Logistic Regression, but with SVM model objects instantiated in place of the LogisticRegression ones.

### Wisconsin Breast Cancer Data Set
We begin our SVM analysis by observing the algorithms' performance on the WBC data set. 

##### Fixed Learning Rate

In [None]:
# instantiate our learning rate object
lr = FixedRate(0.001)

# instantiate our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
agd = NesterovAcceleratedDescent()
svrg = StochasticVarianceReducedGradientDescent()
md = MirrorDescent()

# instantiate all of the SVM model objects
gd_svm = SVM(gd, lr, c, 20000, wbc_n, rel_conv)
sgd_100_svm = SVM(sgd_100, lr, c, 20000, 100, rel_conv)
agd_svm = SVM(agd, lr, c, 20000, wbc_n, rel_conv)
svrg_svm = SVM(svrg, lr, c, 3000, wbc_n, rel_conv)
md_svm = SVM(md, lr, c, 2000, wbc_n, rel_conv)

# run fitting for all of the models
print('Fitting gradient descent:')
wbc_gd_loss = gd_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 100:')
wbc_sgd_100_loss = sgd_100_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting accelerated gradient descent:')
wbc_agd_loss = agd_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic variance reduced gradient descent:')
wbc_svrg_loss = svrg_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_svm.fit(wbc_X_train, wbc_y_train)

# print test accuracies
acc = check_accuracy_svm(gd_svm, wbc_X_test, wbc_y_test)
print("GD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_svm, wbc_X_test, wbc_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(agd_svm, wbc_X_test, wbc_y_test)
print("AGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(svrg_svm, wbc_X_test, wbc_y_test)
print("SVRG Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_svm, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))

# plot our losses
plot_fixed_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_agd_loss, wbc_svrg_loss, wbc_agd_loss)

##### Exponential Decay Learning Rate

In [None]:
# initialize our learning rate object
lr_gd = PolyDecayRate(0.01, 0.0001)
lr_sgd = PolyDecayRate(0.01, 0.00001)
lr_md = PolyDecayRate(0.001, 0.00001)
# initialize our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = SVM(gd, lr_gd, c, 1000, wbc_n, rel_conv)
sgd_100_log = SVM(sgd_100, lr_sgd, c, 2000, 100, rel_conv)
md_log = SVM(md, lr_md, c, 1000, wbc_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_100_loss = sgd_100_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)
# print the test accuracies for each model
acc = check_accuracy_svm(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_log, wbc_X_test, wbc_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_md_loss)

##### Square Root Decay Learning Rate

In [None]:
# initialize our learning rate object
lr_gd = ExpDecayRate(0.01, 0.001)
lr_sgd = ExpDecayRate(0.01, 0.0001)
lr_md = ExpDecayRate(0.001, 0.001)

# initialize our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
md = MirrorDescent()

# initialize logistic regression models
gd_log = SVM(gd, lr_gd, c, 2000, wbc_n, rel_conv)
sgd_100_log = SVM(sgd_100, lr_sgd, c, 6000, 100, rel_conv)
md_log = SVM(md, lr_md, c, 2000, wbc_n, rel_conv)

# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_100_loss = sgd_100_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)

# print the test accuracies for each model
acc = check_accuracy_svm(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_log, wbc_X_test, wbc_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))

# plot the loss convergences for each model
plot_dynamic_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_md_loss)

We can see for the WBC data that our algorithms...
# Peter 
{ insert analysis for previous three plots and results here...
  
  also insert a chart for the timing and iterations to convergence }

### Binarized MNIST Data Set
We now investigate the same algorithms and learning rates performance on the MNIST data set.

##### Fixed Learning Rate

In [None]:
# initialize our learning rate object
lr = FixedRate(0.001)

# initialize our descent method objects
gd = GradientDescent()
sgd_100 = GradientDescent()
agd = NesterovAcceleratedDescent()
svrg = StochasticVarianceReducedGradientDescent()
md = MirrorDescent()

# initialize all of the SVM models
gd_svm = SVM(gd, lr, c, 20000, wbc_n, rel_conv)
sgd_100_svm = SVM(sgd_100, lr, c, 20000, 100, rel_conv)
agd_svm = SVM(agd, lr, c, 20000, wbc_n, rel_conv)
svrg_svm = SVM(svrg, lr, c, 3000, wbc_n, rel_conv)
md_svm = SVM(md, lr, c, 2000, wbc_n, rel_conv)

# run fitting for all of the models
print('Fitting gradient descent:')
wbc_gd_loss = gd_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 100:')
wbc_sgd_100_loss = sgd_100_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting accelerated gradient descent:')
wbc_agd_loss = agd_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic variance reduced gradient descent:')
wbc_svrg_loss = svrg_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_svm.fit(wbc_X_train, wbc_y_train)

# print test accuracies
acc = check_accuracy_svm(gd_svm, wbc_X_test, wbc_y_test)
print("GD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_svm, wbc_X_test, wbc_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(agd_svm, wbc_X_test, wbc_y_test)
print("AGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(svrg_svm, wbc_X_test, wbc_y_test)
print("SVRG Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_svm, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))

plot_fixed_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_agd_loss, wbc_svrg_loss, wbc_agd_loss)

##### Exponential Decay Learning Rate

In [None]:
# initialize our learning rate object
lr_gd = PolyDecayRate(0.01, 0.0001)
lr_sgd = PolyDecayRate(0.01, 0.00001)
lr_md = PolyDecayRate(0.001, 0.00001)
# initialize our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = SVM(gd, lr_gd, c, 1000, wbc_n, rel_conv)
sgd_100_log = SVM(sgd_100, lr_sgd, c, 2000, 100, rel_conv)
md_log = SVM(md, lr_md, c, 1000, wbc_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_100_loss = sgd_100_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)
# print the test accuracies for each model
acc = check_accuracy_svm(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_log, wbc_X_test, wbc_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_md_loss)

##### Square Root Decay Learning Rate

In [None]:
# initialize our learning rate object
lr_gd = ExpDecayRate(0.01, 0.001)
lr_sgd = ExpDecayRate(0.01, 0.0001)
lr_md = ExpDecayRate(0.001, 0.001)
# initialize our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = SVM(gd, lr_gd, c, 2000, wbc_n, rel_conv)
sgd_100_log = SVM(sgd_100, lr_sgd, c, 6000, 100, rel_conv)
md_log = SVM(md, lr_md, c, 2000, wbc_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_100_loss = sgd_100_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)
# print the test accuracies for each model
acc = check_accuracy_svm(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_log, wbc_X_test, wbc_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_md_loss)

We can see for the WBC data that our algorithms...
# Peter 
{ insert analysis for previous three plots and results here...
  
  also insert a chart for the timing and iterations to convergence }

### Wisconsin Breast Cancer Data Set
As with Logistic Regression, we begin with observing the performance of the SVM objective across the different algorithms for the different learning rate paradigms investigated, and then across the three data sets. The configuration of the executed cells is the same as with the Logistic Regression, but with SVM model objects instantiated in place of the LogisticRegression ones.

##### Fixed Learning Rate

In [None]:
# initialize our learning rate object
lr = FixedRate(0.001)

# initialize our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
agd = NesterovAcceleratedDescent()
svrg = StochasticVarianceReducedGradientDescent()
md = MirrorDescent()
# initialize all of the SVM models
gd_svm = SVM(gd, lr, c, 20000, wbc_n, rel_conv)
sgd_100_svm = SVM(sgd_100, lr, c, 20000, 100, rel_conv)
agd_svm = SVM(agd, lr, c, 20000, wbc_n, rel_conv)
svrg_svm = SVM(svrg, lr, c, 3000, wbc_n, rel_conv)
md_svm = SVM(md, lr, c, 2000, wbc_n, rel_conv)

# run fitting for all of the models
print('Fitting gradient descent:')
wbc_gd_loss = gd_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 100:')
wbc_sgd_100_loss = sgd_100_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting accelerated gradient descent:')
wbc_agd_loss = agd_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic variance reduced gradient descent:')
wbc_svrg_loss = svrg_svm.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_svm.fit(wbc_X_train, wbc_y_train)

# print test accuracies
acc = check_accuracy_svm(gd_svm, wbc_X_test, wbc_y_test)
print("GD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_svm, wbc_X_test, wbc_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(agd_svm, wbc_X_test, wbc_y_test)
print("AGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(svrg_svm, wbc_X_test, wbc_y_test)
print("SVRG Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_svm, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))

plot_fixed_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_agd_loss, wbc_svrg_loss, wbc_agd_loss)

##### Exponential Decay Learning Rate

In [None]:
# initialize our learning rate object
lr_gd = PolyDecayRate(0.01, 0.0001)
lr_sgd = PolyDecayRate(0.01, 0.00001)
lr_md = PolyDecayRate(0.001, 0.00001)
# initialize our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = SVM(gd, lr_gd, c, 1000, wbc_n, rel_conv)
sgd_100_log = SVM(sgd_100, lr_sgd, c, 2000, 100, rel_conv)
md_log = SVM(md, lr_md, c, 1000, wbc_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_100_loss = sgd_100_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)
# print the test accuracies for each model
acc = check_accuracy_svm(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_log, wbc_X_test, wbc_y_test)
print("SGD 100 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_md_loss)

##### Square Root Decay Learning Rate

In [None]:
# initialize our learning rate object
lr_gd = ExpDecayRate(0.01, 0.001)
lr_sgd = ExpDecayRate(0.01, 0.0001)
lr_md = ExpDecayRate(0.001, 0.001)
# initialize our descent methods
gd = GradientDescent()
sgd_100 = GradientDescent()
md = MirrorDescent()
# initialize logistic regression models
gd_log = SVM(gd, lr_gd, c, 2000, wbc_n, rel_conv)
sgd_100_log = SVM(sgd_100, lr_sgd, c, 6000, 100, rel_conv)
md_log = SVM(md, lr_md, c, 2000, wbc_n, rel_conv)
# fit the models...
print('Fitting gradient descent:')
wbc_gd_loss = gd_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting stochastic gradient descent, batch size = 1:')
wbc_sgd_100_loss = sgd_100_log.fit(wbc_X_train, wbc_y_train)
print('\nFitting mirror descent:')
wbc_md_loss = md_log.fit(wbc_X_train, wbc_y_train)
# print the test accuracies for each model
acc = check_accuracy_svm(gd_log, wbc_X_test, wbc_y_test)
print("\n\nGD Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(sgd_100_log, wbc_X_test, wbc_y_test)
print("SGD 1 Accuracy: {0:.2f}%".format(acc * 100))
acc = check_accuracy_svm(md_log, wbc_X_test, wbc_y_test)
print("MD Accuracy: {0:.2f}%".format(acc * 100))
# plot the loss convergences for each model
plot_dynamic_svm_losses(wbc_gd_loss, wbc_sgd_100_loss, wbc_md_loss)

We can see for the WBC data that our algorithms...
# Peter 
{ insert analysis for previous three plots and results here...
  
  also insert a chart for the timing and iterations to convergence }