# Nuts and bolts of "Neural Networking"

Today, we will try to fit best NN for solving Fashion MNIST (subset to 1k examples) following recommendation from Andrew Ng lecture on how to train NNs in practice.

Goal is to:

* Get familiar with various basic building blocks
* Understand "nuts and bolts" of training DNNs
    * Understand why we need train, valid and test split
    * Understand how to reduce bias, variance
    * Understand notions of human error
    * Get to know the heuristical DL workflow (overfit -> regularize -> revise priors)

When you are done, please compile a simple pdf report (e.g. you can copy paste figures into a google doc, and save as a pdf) and put it into Dropbox folder. 

Do a bunch of experiments, explain why something helps.

Refs:

* Nuts and bolts of applying Deep Learning: https://www.youtube.com/watch?v=F1ka6a13S9I , summary http://jaejunyoo.blogspot.com/2017/03/nips-2016-tutorial-summary-nuts-and-bolts-of-building-AI-AndrewNg.html
* Introduction to Convolutional networks: http://cs231n.github.io/convolutional-networks/

# Andrew Ng's "Nuts and bolts"

Nuts and bolts of applying Deep Learning: https://www.youtube.com/watch?v=F1ka6a13S9I

Some of them might sound very weird, but use as much as you can in this notebook. We will be coming back to them.

Note: This is very likely to be an exam question.

## The general workflow

Let's define overfitting for our purposes as achieving ~100% training accuracy (which is almost never possible to achieve on validation set).

<img width=400 src="https://3.bp.blogspot.com/-duzBNDYdDGA/WFNtNi0DcNI/AAAAAAAAPSc/AHuvDXl6EhAgweD6IxGAbqOBK5qM_W05QCLcB/s1600/nuts-and-bolts-checklist.png">

## Bias vs variance

"It takes surprisingly long time to grok bias and variance deeply, but people that understand bias and variance deeply are often able to drive very rapid progress." --Andrew Ng 

TODO: Explanation

<img width=500 src="http://1.bp.blogspot.com/-IKBOtqKxf6M/WL4VFKZsI7I/AAAAAAAABYY/vOuV7QmBJSU6ca5vo3I8tzULMwtx5xInACK4B/s1600/andrewNg_8.PNG">

## Use DNNs only when you have a lot of data

Always use more data

<img width=500 src="https://github.com/gmum/nn2018/raw/master/lab/fig/7/perf.png">

## List of things we can tune

* Add/remove blocks:
    - Batch Normalization
    - Dropout
    - Convolution
    - Pooling
    - Dense
    - Activation
* Tune regularization
    - Dropout
    - L2
* Alter parameters of blocks
    - Number of units
    - Nonlinearity type
* Change optimization hyperparameters
    - Learning rate
    - Batch size

# Setup

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import tqdm
import json

import torch
import torch.nn.functional as F

from torch import optim
from torch import nn
from torch.autograd import Variable

from keras.datasets import fashion_mnist
from keras.utils import np_utils

%matplotlib inline
import matplotlib.pylab as plt
import matplotlib as mpl

from torch.autograd import gradcheck

mpl.rcParams['lines.linewidth'] = 2
mpl.rcParams['figure.figsize'] = (7, 7)
mpl.rcParams['axes.titlesize'] = 12
mpl.rcParams['axes.labelsize'] = 12

# Get FashionMNIST (see 1b_FMNIST.ipynb for data exploration)
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# Logistic regression needs 2D data
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)

# 0-1 normalization
x_train = x_train / 255.
x_test = x_test / 255.

# Convert to Torch Tensor. Just to avoid boilerplate code later
x_train = torch.from_numpy(x_train).type(torch.FloatTensor)
x_test = torch.from_numpy(x_test).type(torch.FloatTensor)
y_train = torch.from_numpy(y_train).type(torch.LongTensor)
y_test = torch.from_numpy(y_test).type(torch.LongTensor)

# Use only first 1k examples. Just for notebook to run faster
x_valid, y_valid = x_train[1000:2000], y_train[1000:2000]
x_train, y_train = x_train[0:1000], y_train[0:1000]
x_test, y_test = x_test[0:1000], y_test[0:1000]

Using Theano backend.


# Starting point

This section gives basic model. Please adapt yourself training loop from the previous notebook.

In [1]:
def build_simple_mlp(input_dim, output_dim):
    model = torch.nn.Sequential()
    model.add_module("linear_1", torch.nn.Linear(input_dim, 512, bias=False))
    model.add_module("nonlinearity_1", torch.nn.Sigmoid())
    model.add_module("linear_2", torch.nn.Linear(512, output_dim, bias=False))
    return model

## Training

Our goal is to go through different types of blocks without very in-depth understanding. 

In [None]:
# Simple tuning routine

# Step 1 - overfit using large MLP (bias from N&B)

# Step 2 - regularize (variance from N&B)

## Dropout

## Batch Normalization

## Weight decay

## Which one (alone) was most effective?

# Step 3 - better adaptation (train-test mismatch from N&B)

## Tune CNN architecture

Compare a good CNN (tune its hyperparameters on valid) to a good MLP (tune its hyperparameters on valid).

## Tune architecture: CNN vs MLP

We will have separate lab on convolutions. A crash course on CNNs:

<img width=300 src=http://cs231n.github.io/assets/nn1/neural_net2.jpeg>

<img width=400 src=http://cs231n.github.io/assets/cnn/stride.jpeg>

CNN hyperparameters:

* Number of filters
* Filter size
* Stride (less important usually)

Ref: 
* Images from http://cs231n.github.io/convolutional-networks/
* How to create CNNs in PyTorch https://github.com/vinhkhuc/PyTorch-Mini-Tutorials/blob/master/5_convolutional_net.py

# Extra tuning - LR regularization effect
