# Optimizing Neural Networks

To remind, necessary ingredients to train NN:
    * model
    * objective
    * optimizer
    
Today we will try to understand basics of optimization of neural networks, giving context for the last two lectures. Goal is to:
* Understand basics of generalization, and the difference between optimization and generalization (more on that in "Understanding generalization" lab)
* Understand impact of hyperparameters in SGD on:

  - generalization (lr, batch size)
  - speed of optimization (lr, momentum, batch size) 

References:
* Deep Learning book chapter on optimization: http://www.deeplearningbook.org/contents/optimization.html

# Setup

In [1]:
# Boilerplate code to get started

%load_ext autoreload
%autoreload 
%matplotlib inline

import json
import matplotlib as mpl
from src import fmnist_utils
from src.fmnist_utils import *

def plot(H):
    plt.title(max(H['test_acc']))
    plt.plot(H['acc'], label="acc")
    plt.plot(H['test_acc'], label="test_acc")
    plt.legend()

mpl.rcParams['lines.linewidth'] = 2
mpl.rcParams['figure.figsize'] = (7, 7)
mpl.rcParams['axes.titlesize'] = 12
mpl.rcParams['axes.labelsize'] = 12

(x_train, y_train), (x_test, y_test) = fmnist_utils.get_data()

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Whiteboard exercises

(1 point for each)

* Give a case in which layers in neural networks might learn with different speed. Do layers in neural networks learn with different speeds usually?
* Write expression for RMSProp. Interpret the equation.
* Compare weight decay and L2 regularization. Explain the difference.

# Exercise 1: optimization speed

Assuming fixed number of *epochs*, it is usually better to use either smaller batch size, or larger learning rate. Theoretical reason for it is not completely clear, so let's focus in this exercise on an empirical investigation.

Assume you are allowed to train the given network for 10 epochs. Answer the following questions:

* a) What was the optimal $\eta$ (assuming $S$=128 and $\mu$=0.9) for the final training accuracy?
* b) Did it also provide the best test accuracy? If yes, why (hint: consider if model is under or over-fitting)?
* c) What is the optimal $S$ (assuming $\eta$=0.1 and $\mu$=0.9) for the final training accuracy?
* d) Why is higher learning rate, or smaller batch size, optimizing faster? Give your best explanation (it can be hypothetical, there is no obvious theoretical answer)?

In [2]:
for lr in [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]:
    model = build_mlp(784, 10, hidden_dims=[512])
    loss = torch.nn.CrossEntropyLoss(size_average=True)
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    H = train(loss=loss, model=model, x_train=x_train, y_train=y_train,
              x_test=x_test, y_test=y_test,
              optim=optimizer, batch_size=128, n_epochs=10)
    print("lr: ", lr, " train_acc: ", H['acc'][-1], " test_acc: ", H['test_acc'][-1])

100%|██████████| 10/10 [00:01<00:00,  9.46it/s]
 10%|█         | 1/10 [00:00<00:00,  9.86it/s]

lr:  0.0001  train_acc:  0.146  test_acc:  0.147


100%|██████████| 10/10 [00:00<00:00, 11.18it/s]
 20%|██        | 2/10 [00:00<00:00, 12.38it/s]

lr:  0.0005  train_acc:  0.477  test_acc:  0.482


100%|██████████| 10/10 [00:00<00:00, 12.52it/s]
  0%|          | 0/10 [00:00<?, ?it/s]

lr:  0.001  train_acc:  0.585  test_acc:  0.562


100%|██████████| 10/10 [00:01<00:00,  9.36it/s]
 10%|█         | 1/10 [00:00<00:01,  8.90it/s]

lr:  0.005  train_acc:  0.679  test_acc:  0.659


100%|██████████| 10/10 [00:00<00:00, 11.32it/s]
 10%|█         | 1/10 [00:00<00:00,  9.46it/s]

lr:  0.01  train_acc:  0.746  test_acc:  0.712


100%|██████████| 10/10 [00:01<00:00,  8.39it/s]
 10%|█         | 1/10 [00:00<00:00,  9.91it/s]

lr:  0.05  train_acc:  0.87  test_acc:  0.775


100%|██████████| 10/10 [00:00<00:00, 10.87it/s]
 20%|██        | 2/10 [00:00<00:00, 12.52it/s]

lr:  0.1  train_acc:  0.869  test_acc:  0.782


100%|██████████| 10/10 [00:00<00:00, 11.50it/s]

lr:  0.5  train_acc:  0.671  test_acc:  0.595





In [3]:
for bs in [2, 4, 8, 16, 32, 64, 128, 256, 512]:
    model = build_mlp(784, 10, hidden_dims=[512])
    loss = torch.nn.CrossEntropyLoss(size_average=True)
    optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
    H = train(loss=loss, model=model, x_train=x_train, y_train=y_train,
              x_test=x_test, y_test=y_test,
              optim=optimizer, batch_size=bs, n_epochs=10)
    print("bs: ", bs, " train_acc: ", H['acc'][-1], " test_acc: ", H['test_acc'][-1])

100%|██████████| 10/10 [00:31<00:00,  3.59s/it]
  0%|          | 0/10 [00:00<?, ?it/s]

bs:  2  train_acc:  0.102  test_acc:  0.095


100%|██████████| 10/10 [00:16<00:00,  1.86s/it]
  0%|          | 0/10 [00:00<?, ?it/s]

bs:  4  train_acc:  0.115  test_acc:  0.095


100%|██████████| 10/10 [00:07<00:00,  1.13it/s]
  0%|          | 0/10 [00:00<?, ?it/s]

bs:  8  train_acc:  0.213  test_acc:  0.188


100%|██████████| 10/10 [00:03<00:00,  3.11it/s]
  0%|          | 0/10 [00:00<?, ?it/s]

bs:  16  train_acc:  0.69  test_acc:  0.591


100%|██████████| 10/10 [00:02<00:00,  4.64it/s]
 10%|█         | 1/10 [00:00<00:01,  5.88it/s]

bs:  32  train_acc:  0.819  test_acc:  0.715


100%|██████████| 10/10 [00:01<00:00,  6.54it/s]
 10%|█         | 1/10 [00:00<00:01,  8.96it/s]

bs:  64  train_acc:  0.912  test_acc:  0.785


100%|██████████| 10/10 [00:01<00:00,  8.22it/s]
 20%|██        | 2/10 [00:00<00:00, 10.93it/s]

bs:  128  train_acc:  0.849  test_acc:  0.758


100%|██████████| 10/10 [00:00<00:00, 11.34it/s]
 20%|██        | 2/10 [00:00<00:00, 15.13it/s]

bs:  256  train_acc:  0.782  test_acc:  0.708


100%|██████████| 10/10 [00:00<00:00, 14.30it/s]

bs:  512  train_acc:  0.706  test_acc:  0.682





In [4]:
answers = {"a": "0.1", "b": "Yes. Might be underfitting?", "c": "128", "d": "Higher lr / lower batchsize is more noisy. Maybe it allows you to break out of flat areas / past saddle-points faster."}
json.dump(answers, open("7b_ex1.json", "w"))

# Exercise 2: generalization

Story with generalization is also unclear, but it is generally accepted that higher noise levels in SGD lead to better generalization. Think of noise in optimization (leading to low fidelity, as seen in lab 7a, for instance) as a close analog of typical regularizations (like dropout or batch normalization, that we will discuss next time).

Your task is to:

a) Check a range of LR and BS and find the best generalizing combination of LR and BS. What test accuracy were you able to achieve? What is the best LR and BS combination?

b) Answer the following question: Is stability correlated with using large LR or small BS. If yes, what is the intuitive reason for it? Feel free to give a hypothesis.

Hints:

* Make sure you achieve 100% training accuracy with each run, discard hyperparameters that are not achieving convergence.

Notes:

* Do not change the model in the starting code. It is on purpose a bit more complex MLP.

* You can measure stability by computing margin. This is implemented for you (using DeepFool method, https://arxiv.org/abs/1511.04599). Measuring margin is expensive, so recommended approach would be to compute it only on few final runs. 

In [5]:
for lr in [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]:
    for bs in [2, 4, 8, 16, 32, 64, 128, 256, 512]:
        model = build_mlp(784, 10, hidden_dims=[512])
        loss = torch.nn.CrossEntropyLoss(size_average=True)
        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
        H = train(loss=loss, model=model, x_train=x_train, y_train=y_train,
                  x_test=x_test, y_test=y_test,
                  optim=optimizer, batch_size=bs, n_epochs=20)
        print("lr: ", lr, " bs: ", bs, " train_acc: ", H['acc'][-1], " test_acc: ", H['test_acc'][-1])

100%|██████████| 20/20 [00:43<00:00,  2.29s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0001  bs:  2  train_acc:  0.797  test_acc:  0.742


100%|██████████| 20/20 [00:23<00:00,  1.28s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0001  bs:  4  train_acc:  0.741  test_acc:  0.701


100%|██████████| 20/20 [00:08<00:00,  2.59it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0001  bs:  8  train_acc:  0.688  test_acc:  0.667


100%|██████████| 20/20 [00:05<00:00,  2.69it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0001  bs:  16  train_acc:  0.644  test_acc:  0.633


100%|██████████| 20/20 [00:04<00:00,  4.17it/s]
  5%|▌         | 1/20 [00:00<00:02,  6.37it/s]

lr:  0.0001  bs:  32  train_acc:  0.585  test_acc:  0.567


100%|██████████| 20/20 [00:03<00:00,  6.98it/s]
  5%|▌         | 1/20 [00:00<00:02,  8.07it/s]

lr:  0.0001  bs:  64  train_acc:  0.471  test_acc:  0.475


100%|██████████| 20/20 [00:02<00:00,  8.92it/s]
  5%|▌         | 1/20 [00:00<00:02,  7.08it/s]

lr:  0.0001  bs:  128  train_acc:  0.124  test_acc:  0.128


100%|██████████| 20/20 [00:02<00:00,  9.66it/s]
 10%|█         | 2/20 [00:00<00:01, 13.00it/s]

lr:  0.0001  bs:  256  train_acc:  0.103  test_acc:  0.122


100%|██████████| 20/20 [00:01<00:00, 14.61it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0001  bs:  512  train_acc:  0.104  test_acc:  0.125


100%|██████████| 20/20 [00:41<00:00,  2.01s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0005  bs:  2  train_acc:  0.924  test_acc:  0.802


100%|██████████| 20/20 [00:20<00:00,  1.08s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0005  bs:  4  train_acc:  0.855  test_acc:  0.769


100%|██████████| 20/20 [00:10<00:00,  1.89it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0005  bs:  8  train_acc:  0.821  test_acc:  0.767


100%|██████████| 20/20 [00:06<00:00,  3.23it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0005  bs:  16  train_acc:  0.762  test_acc:  0.72


100%|██████████| 20/20 [00:04<00:00,  5.00it/s]
  5%|▌         | 1/20 [00:00<00:02,  6.35it/s]

lr:  0.0005  bs:  32  train_acc:  0.698  test_acc:  0.674


100%|██████████| 20/20 [00:03<00:00,  6.92it/s]
  5%|▌         | 1/20 [00:00<00:02,  8.75it/s]

lr:  0.0005  bs:  64  train_acc:  0.653  test_acc:  0.632


100%|██████████| 20/20 [00:02<00:00,  9.17it/s]
 10%|█         | 2/20 [00:00<00:01, 12.13it/s]

lr:  0.0005  bs:  128  train_acc:  0.583  test_acc:  0.568


100%|██████████| 20/20 [00:01<00:00, 11.23it/s]
 10%|█         | 2/20 [00:00<00:01, 13.76it/s]

lr:  0.0005  bs:  256  train_acc:  0.428  test_acc:  0.421


100%|██████████| 20/20 [00:01<00:00, 13.86it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.0005  bs:  512  train_acc:  0.103  test_acc:  0.122


100%|██████████| 20/20 [00:42<00:00,  2.23s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.001  bs:  2  train_acc:  0.921  test_acc:  0.778


100%|██████████| 20/20 [00:21<00:00,  1.08s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.001  bs:  4  train_acc:  0.917  test_acc:  0.784


100%|██████████| 20/20 [00:11<00:00,  1.84it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.001  bs:  8  train_acc:  0.881  test_acc:  0.793


100%|██████████| 20/20 [00:06<00:00,  2.53it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.001  bs:  16  train_acc:  0.82  test_acc:  0.764


100%|██████████| 20/20 [00:04<00:00,  4.76it/s]
  5%|▌         | 1/20 [00:00<00:03,  5.55it/s]

lr:  0.001  bs:  32  train_acc:  0.764  test_acc:  0.722


100%|██████████| 20/20 [00:03<00:00,  6.70it/s]
  5%|▌         | 1/20 [00:00<00:02,  7.73it/s]

lr:  0.001  bs:  64  train_acc:  0.685  test_acc:  0.658


100%|██████████| 20/20 [00:02<00:00,  8.49it/s]
  5%|▌         | 1/20 [00:00<00:02,  9.15it/s]

lr:  0.001  bs:  128  train_acc:  0.656  test_acc:  0.637


100%|██████████| 20/20 [00:01<00:00, 10.23it/s]
 10%|█         | 2/20 [00:00<00:01, 13.65it/s]

lr:  0.001  bs:  256  train_acc:  0.535  test_acc:  0.513


100%|██████████| 20/20 [00:01<00:00, 11.71it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.001  bs:  512  train_acc:  0.113  test_acc:  0.127


100%|██████████| 20/20 [00:46<00:00,  2.55s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.005  bs:  2  train_acc:  0.903  test_acc:  0.744


100%|██████████| 20/20 [00:25<00:00,  1.23s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.005  bs:  4  train_acc:  0.922  test_acc:  0.76


100%|██████████| 20/20 [00:14<00:00,  1.63it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.005  bs:  8  train_acc:  0.963  test_acc:  0.798


100%|██████████| 20/20 [00:08<00:00,  2.45it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.005  bs:  16  train_acc:  0.934  test_acc:  0.79


100%|██████████| 20/20 [00:05<00:00,  4.18it/s]
  5%|▌         | 1/20 [00:00<00:03,  5.66it/s]

lr:  0.005  bs:  32  train_acc:  0.895  test_acc:  0.782


100%|██████████| 20/20 [00:03<00:00,  5.44it/s]
  5%|▌         | 1/20 [00:00<00:02,  7.75it/s]

lr:  0.005  bs:  64  train_acc:  0.818  test_acc:  0.76


100%|██████████| 20/20 [00:02<00:00,  7.09it/s]
  5%|▌         | 1/20 [00:00<00:02,  8.60it/s]

lr:  0.005  bs:  128  train_acc:  0.763  test_acc:  0.726


100%|██████████| 20/20 [00:02<00:00,  9.15it/s]
 10%|█         | 2/20 [00:00<00:01, 11.92it/s]

lr:  0.005  bs:  256  train_acc:  0.68  test_acc:  0.664


100%|██████████| 20/20 [00:01<00:00, 11.97it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.005  bs:  512  train_acc:  0.566  test_acc:  0.533


100%|██████████| 20/20 [01:12<00:00,  3.60s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.01  bs:  2  train_acc:  0.789  test_acc:  0.688


100%|██████████| 20/20 [00:30<00:00,  1.62s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.01  bs:  4  train_acc:  0.906  test_acc:  0.763


100%|██████████| 20/20 [00:14<00:00,  1.49it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.01  bs:  8  train_acc:  0.959  test_acc:  0.791


100%|██████████| 20/20 [00:09<00:00,  2.17it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.01  bs:  16  train_acc:  0.971  test_acc:  0.802


100%|██████████| 20/20 [00:05<00:00,  3.59it/s]
  5%|▌         | 1/20 [00:00<00:03,  5.92it/s]

lr:  0.01  bs:  32  train_acc:  0.911  test_acc:  0.791


100%|██████████| 20/20 [00:03<00:00,  5.25it/s]
  5%|▌         | 1/20 [00:00<00:02,  7.33it/s]

lr:  0.01  bs:  64  train_acc:  0.878  test_acc:  0.79


100%|██████████| 20/20 [00:02<00:00,  7.82it/s]
 10%|█         | 2/20 [00:00<00:01, 10.54it/s]

lr:  0.01  bs:  128  train_acc:  0.815  test_acc:  0.764


100%|██████████| 20/20 [00:02<00:00,  8.20it/s]
 10%|█         | 2/20 [00:00<00:01, 10.94it/s]

lr:  0.01  bs:  256  train_acc:  0.732  test_acc:  0.7


100%|██████████| 20/20 [00:01<00:00, 11.50it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.01  bs:  512  train_acc:  0.647  test_acc:  0.623


100%|██████████| 20/20 [01:22<00:00,  4.21s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.05  bs:  2  train_acc:  0.107  test_acc:  0.108


100%|██████████| 20/20 [00:39<00:00,  2.02s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.05  bs:  4  train_acc:  0.145  test_acc:  0.134


100%|██████████| 20/20 [00:18<00:00,  1.04s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.05  bs:  8  train_acc:  0.655  test_acc:  0.565


100%|██████████| 20/20 [00:07<00:00,  1.95it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.05  bs:  16  train_acc:  0.869  test_acc:  0.755


100%|██████████| 20/20 [00:04<00:00,  4.82it/s]
  5%|▌         | 1/20 [00:00<00:02,  6.41it/s]

lr:  0.05  bs:  32  train_acc:  0.93  test_acc:  0.799


100%|██████████| 20/20 [00:03<00:00,  5.60it/s]
  5%|▌         | 1/20 [00:00<00:02,  7.31it/s]

lr:  0.05  bs:  64  train_acc:  0.951  test_acc:  0.808


100%|██████████| 20/20 [00:02<00:00,  7.55it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.05  bs:  128  train_acc:  0.93  test_acc:  0.791


100%|██████████| 20/20 [00:01<00:00, 10.77it/s]
 10%|█         | 2/20 [00:00<00:01, 13.34it/s]

lr:  0.05  bs:  256  train_acc:  0.823  test_acc:  0.725


100%|██████████| 20/20 [00:01<00:00, 12.49it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.05  bs:  512  train_acc:  0.705  test_acc:  0.669


100%|██████████| 20/20 [01:21<00:00,  3.85s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.1  bs:  2  train_acc:  0.102  test_acc:  0.096


100%|██████████| 20/20 [00:36<00:00,  1.99s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.1  bs:  4  train_acc:  0.104  test_acc:  0.105


100%|██████████| 20/20 [00:19<00:00,  1.11s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.1  bs:  8  train_acc:  0.107  test_acc:  0.107


100%|██████████| 20/20 [00:10<00:00,  1.68it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.1  bs:  16  train_acc:  0.547  test_acc:  0.493


100%|██████████| 20/20 [00:04<00:00,  4.90it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.1  bs:  32  train_acc:  0.864  test_acc:  0.751


100%|██████████| 20/20 [00:03<00:00,  5.93it/s]
  5%|▌         | 1/20 [00:00<00:02,  8.51it/s]

lr:  0.1  bs:  64  train_acc:  0.906  test_acc:  0.79


100%|██████████| 20/20 [00:02<00:00,  7.05it/s]
 10%|█         | 2/20 [00:00<00:01, 10.66it/s]

lr:  0.1  bs:  128  train_acc:  0.913  test_acc:  0.79


100%|██████████| 20/20 [00:02<00:00,  8.11it/s]
  5%|▌         | 1/20 [00:00<00:01,  9.73it/s]

lr:  0.1  bs:  256  train_acc:  0.906  test_acc:  0.785


100%|██████████| 20/20 [00:01<00:00, 11.79it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.1  bs:  512  train_acc:  0.724  test_acc:  0.691


100%|██████████| 20/20 [01:19<00:00,  4.02s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.5  bs:  2  train_acc:  0.115  test_acc:  0.095


100%|██████████| 20/20 [00:41<00:00,  2.20s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.5  bs:  4  train_acc:  0.099  test_acc:  0.095


100%|██████████| 20/20 [00:22<00:00,  1.15s/it]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.5  bs:  8  train_acc:  0.1  test_acc:  0.087


100%|██████████| 20/20 [00:10<00:00,  1.69it/s]
  0%|          | 0/20 [00:00<?, ?it/s]

lr:  0.5  bs:  16  train_acc:  0.107  test_acc:  0.107


100%|██████████| 20/20 [00:05<00:00,  3.74it/s]
  5%|▌         | 1/20 [00:00<00:03,  5.48it/s]

lr:  0.5  bs:  32  train_acc:  0.104  test_acc:  0.105


100%|██████████| 20/20 [00:04<00:00,  4.50it/s]
  5%|▌         | 1/20 [00:00<00:03,  5.84it/s]

lr:  0.5  bs:  64  train_acc:  0.411  test_acc:  0.378


100%|██████████| 20/20 [00:03<00:00,  7.23it/s]
  5%|▌         | 1/20 [00:00<00:02,  9.06it/s]

lr:  0.5  bs:  128  train_acc:  0.717  test_acc:  0.648


100%|██████████| 20/20 [00:02<00:00,  8.81it/s]
 10%|█         | 2/20 [00:00<00:01, 12.48it/s]

lr:  0.5  bs:  256  train_acc:  0.805  test_acc:  0.704


100%|██████████| 20/20 [00:01<00:00, 11.99it/s]

lr:  0.5  bs:  512  train_acc:  0.696  test_acc:  0.661





In [6]:
answers = {"a": "Training till 100% training accuracy, the combination with the highest test accuracy was lr: 0.05, bs: 128 (with test acc of 0.793).", "b": "It does seem like it is (though with lr mattering more). If higher lr regularizes more, it makes sense that stability is higher as well.", "c": "There is no c?"}
json.dump(answers, open("7b_ex2.json", "w"))

## Stability measure

In 7a lab we discussed bias/variance view. Here, we will take a stability based view. To estimate stability, 
we will record maximum change in prediction when adding gaussian noise to examples. This is a very rudimentary
way to estimate geometric margin of the network, and we will talk more about this later.

In [7]:
from src.deepfool import measure_stability_deepfool

## Finding optimal $\eta$ and $S$

In [8]:
## Starting code

Hs = []
Lrs = [0.001, 0.005, 0.01, 0.05]
Margins = []
bss = [16, 32, 64, 128]

for lr in Lrs:
    for bs in bss:
        model = build_mlp(784, 10, hidden_dims=[100, 100, 100])
        loss = torch.nn.CrossEntropyLoss(size_average=True)
        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
        H = train(loss=loss, model=model, x_train=x_train, y_train=y_train,
                  x_test=x_test, y_test=y_test,
                  optim=optimizer, batch_size=100, n_epochs=400)
        Margins.append(measure_stability_deepfool(model=model, 
                    x_train=x_train, y_train=y_train, loss=loss, N=1000))
        Hs.append(H)

100%|██████████| 400/400 [00:26<00:00, 14.84it/s]
100%|██████████| 400/400 [00:36<00:00,  6.02it/s]
100%|██████████| 400/400 [01:24<00:00,  6.17it/s]
100%|██████████| 400/400 [01:19<00:00,  4.64it/s]
100%|██████████| 400/400 [01:26<00:00,  4.92it/s]
100%|██████████| 400/400 [00:33<00:00, 12.03it/s]
100%|██████████| 400/400 [00:30<00:00, 17.13it/s]
100%|██████████| 400/400 [00:29<00:00, 13.53it/s]
100%|██████████| 400/400 [00:34<00:00, 15.87it/s]
100%|██████████| 400/400 [00:36<00:00,  6.36it/s]
100%|██████████| 400/400 [00:49<00:00,  8.54it/s]
100%|██████████| 400/400 [00:41<00:00,  9.54it/s]
100%|██████████| 400/400 [00:35<00:00,  7.63it/s]
100%|██████████| 400/400 [00:21<00:00, 17.43it/s]
100%|██████████| 400/400 [00:22<00:00, 17.65it/s]
100%|██████████| 400/400 [00:26<00:00, 15.04it/s]


In [9]:
i = 0
for lr in Lrs:
    for bs in bss:
        print("lr: ", lr, " bs: ", bs, " train_acc: ", Hs[i]['acc'][-1], " test_acc: ", Hs[i]['test_acc'][-1])
        print(Margins[i])
        i += 1

lr:  0.001  bs:  16  train_acc:  0.918  test_acc:  0.77
0.6724442
lr:  0.001  bs:  32  train_acc:  0.916  test_acc:  0.769
0.66856253
lr:  0.001  bs:  64  train_acc:  0.912  test_acc:  0.767
0.6687987
lr:  0.001  bs:  128  train_acc:  0.903  test_acc:  0.769
0.6767893
lr:  0.005  bs:  16  train_acc:  1.0  test_acc:  0.766
0.44146228
lr:  0.005  bs:  32  train_acc:  1.0  test_acc:  0.766
0.43612298
lr:  0.005  bs:  64  train_acc:  1.0  test_acc:  0.766
0.43698686
lr:  0.005  bs:  128  train_acc:  1.0  test_acc:  0.767
0.43706694
lr:  0.01  bs:  16  train_acc:  1.0  test_acc:  0.772
0.48114952
lr:  0.01  bs:  32  train_acc:  1.0  test_acc:  0.771
0.46749696
lr:  0.01  bs:  64  train_acc:  1.0  test_acc:  0.775
0.4705845
lr:  0.01  bs:  128  train_acc:  1.0  test_acc:  0.776
0.47775257
lr:  0.05  bs:  16  train_acc:  1.0  test_acc:  0.8
0.79633665
lr:  0.05  bs:  32  train_acc:  1.0  test_acc:  0.792
0.8934287
lr:  0.05  bs:  64  train_acc:  1.0  test_acc:  0.791
0.8243331
lr:  0.05  bs: 