<a href="https://colab.research.google.com/github/lucazanottifragonara/DAV_SVM/blob/main/tut_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NN Best Practices

## 1. Learning Rates

There is a subtle distinction to be made between types of hyperparameters. 

Optimization Hyperparameters:

* **Learning Rate**
* Batch Size
* Epochs

Architecture Hyperparameters:

* Activation Functions
* Number of neurons per layer
* Number of layers

Learning rate affects optimization convergence. Too big and the linear approximation of the loss used by gradient-based methods becomes useless. Too small and we get guaranteed, but slow convergence to a local optimum.

We're going to look at some scenarios where the learning rate is too large/small so we can identify this phenomenon in practice and easily debug our networks

In [None]:
import torch

In [None]:
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        
        self.fc1 = torch.nn.Linear(100, 50)
        self.fc2 = torch.nn.Linear(50, 20)
        self.fc3 = torch.nn.Linear(20, 1)
        
        self.sigmoid = torch.nn.Sigmoid()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.sigmoid(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        x = self.fc3(x)
        
        return x

In [None]:
def create_dataset(n_samples, n_features):
    x = torch.randn(n_samples, n_features)
    y = torch.sum(x[:, :n_features // 2] / 10) ** 2 + torch.sum(x[:, :n_features // 2] / 10)
    
    return x, y

Run the next two cells for lr values of 100, 10, 1, 0.2, and 0.001. Notice that 

* 100, 10, and 1 both blow up the loss to infinity resulting in divergence that cannot be recovered from. 
* 0.2 results in "convergence" to a very bad loss value, likely due to sigmoid saturation
* 0.001 Allows for significant loss minimization.

In [None]:
torch.cuda.manual_seed(7)
torch.manual_seed(7)
torch.backends.cudnn.deterministic = True
x, y = create_dataset(1000, 100)

model = Model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.0) # Show lr of 100, 10, 1, and 0.1. Possible that lr of 1 converges to bad loss due to sigmoid saturation.
criterion = torch.nn.MSELoss()

In [None]:
for i in range(200):
    pred = model(x)
    loss = criterion(pred, y)
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    print("Loss: {}".format(loss))

Loss: 616232.8125
Loss: 600976.9375
Loss: 567573.5
Loss: 522713.90625
Loss: 479982.0
Loss: 440597.0
Loss: 404398.0625
Loss: 371157.0625
Loss: 340641.96875
Loss: 312632.75
Loss: 286925.09375
Loss: 263330.625
Loss: 241675.9375
Loss: 221801.75
Loss: 203561.765625
Loss: 186821.6875
Loss: 171458.171875
Loss: 157358.078125
Loss: 144417.484375
Loss: 132541.078125
Loss: 121641.328125
Loss: 111637.9375
Loss: 102457.203125
Loss: 94031.453125
Loss: 86298.6015625
Loss: 79201.6953125
Loss: 72688.4140625
Loss: 66710.7421875
Loss: 61224.66796875
Loss: 56189.75
Loss: 51568.8671875
Loss: 47328.015625
Loss: 43435.90234375
Loss: 39863.8828125
Loss: 36585.6171875
Loss: 33576.92578125
Loss: 30815.6796875
Loss: 28281.482421875
Loss: 25955.71875
Loss: 23821.208984375
Loss: 21862.22265625
Loss: 20064.345703125
Loss: 18414.326171875
Loss: 16899.990234375
Loss: 15510.1884765625
Loss: 14234.681640625
Loss: 13064.0751953125
Loss: 11989.71875
Loss: 11003.7275390625
Loss: 10098.8212890625
Loss: 9268.3271484375
Loss

### Rules of Thumb

1. Use Adam with initial learning rate of 0.001
2. Use He normal initialization for relu neurons, and Glorot normal initialization for tanh neurons

Here $n$ is the number of inputs to a layer and $m$ is the number of outputs, i.e. fan-in and fan-out

He: $\mathcal{N}(0, \sqrt{\frac{2}{n}})$

Glorot: $\mathcal{N}(0, \sqrt{\frac{2}{n + m}})$

A bad weight initialization will lead to the same problems as a poorly set learning rate, though is much harder to debug since people don't put much thought into it.

## 2. Regularization 

Recall that the point of regularization is to prevent overfitting. If we use too much though,  we end up with underfitting. These two concepts can be difficult to define precisely, but a possible indication of underfitting is lack of converengence to near perfect accuracy on the training set. 

Assuming your model has high enough capacity, it should be able to perfectly fit the training set. More regularization --> less capacity

In [None]:
import torch
import torchvision
from torchvision import transforms

In [None]:
def load_mnist(batch_size):
    train_transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
    
    train_set = torchvision.datasets.MNIST(root="./", train=True, download=True, transform=train_transform)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True)
    train_size = len(train_set)

    test_set = torchvision.datasets.MNIST(root="./", train=False, download=True, transform=train_transform)
    test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False)
    test_size = len(test_set)
    
    data_loaders = {"train": train_loader, "test": test_loader}
    dataset_sizes = {"train": train_size, "test": test_size}
    
    height, width = test_set.test_data.shape[1:]
    channels = 1
    classes = 10
    
    return data_loaders, dataset_sizes, height, width, channels, classes
    
    
    

In [None]:
class ClassificationModel(torch.nn.Module):
    def __init__(self, height, width, channels, n_classes):
        super(ClassificationModel, self).__init__()
        
        self.fc1 = torch.nn.Linear(height * width * channels, 200)
        self.fc2 = torch.nn.Linear(200, 100)
        self.fc3 = torch.nn.Linear(100, n_classes)
        
        self.relu = torch.nn.ReLU()
        
    def forward(self, x):
        x = x.view(x.shape[0], -1)
        
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        
        return x

In [None]:
batch_size = 100
data_loaders, dataset_sizes, height, width, channels, classes = load_mnist(batch_size)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


Run the next two cells for the values of weight decay 10, 1 and 0.1 (stop after 3 or 4 epochs for each value since that will be enough to highlight the difference). Notice that

* 10 and 1 lead to significant underfitting (pretty much random guessing) since too much emphasis put on making weights small. Thus little or no progress on either training or test sets
* 0.1 after just a single epochs gets us to 75% train accuracy. This might still be too high if by epoch 100 lets say it hasn't cracked 90% accuracy, so we can't conclude anything about underfitting yet for this value

In [None]:
criterion = torch.nn.CrossEntropyLoss()
model = ClassificationModel(height, width, channels, classes)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0, weight_decay=0.1) # Show weight decay of 10, 1 and 0.1. 10 underfits, 1 underfits, 0.1 is okay.
epochs = 100
use_gpu = True

if use_gpu:
    model = model.cuda()

In [None]:
for epoch in range(epochs):
    for phase in ["train", "test"]:
        if phase == "train":
            model.train(True)
        else:
            model.train(False)
            
        running_loss = 0.0
        running_corrects = 0
        
        for data in data_loaders[phase]:
            x, y = data
            
            if use_gpu:
                x, y = x.cuda(), y.cuda()
                
            optimizer.zero_grad()
            
            outputs = model(x)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, y)
            
            if phase == "train":
                loss.backward()
                optimizer.step()
                
            running_loss += loss.data.item() * x.size(0)
            running_corrects += torch.sum(preds == y.data).item()
            
        epoch_loss = running_loss / dataset_sizes[phase]
        epoch_acc = running_corrects / dataset_sizes[phase]
        
        print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

train Loss: 1.1814 Acc: 0.7672
test Loss: 0.7693 Acc: 0.8596
train Loss: 0.7197 Acc: 0.8616
test Loss: 0.6643 Acc: 0.8738
train Loss: 0.6662 Acc: 0.8713
test Loss: 0.6403 Acc: 0.8760


KeyboardInterrupt: ignored

### Rules of Thumb

1. Use no regularization to begin with. You need to make sure your model is able to fit the training data first, and then worry about generalization.
2. Always remember to normalize your inputs: $\frac{x - \mu}{\sigma}$

What if we need regularization? How do we choose $\lambda$ for L1 or L2 regularization?

1. Look at the range of your loss function without regularization. I.e. observe if there is some upper bound $k$ s.t. $\mathcal{L}_{class} < k$ after a few epochs
2. Look at the magnitude of the regularization term $\mathcal{L}_2$. Choose $\lambda$ s.t. $\lambda \mathcal{L}_2 < k$

Why? This makes sure that we don't care more about the regularization than we do about the actual task. Remember, this is just a starting point. If you want to squeeze out maximum performance, you need to do hyperparameter optimization  for $\lambda$.



## Regularization Techniques

So we've talked about the basics like constraining weights which was taken from linear regression, and is not a NN specific regularization technique. 

How about something NN specific?

### BatchNorm

Batch normalization is an implicit regularization technique in that it doesn't add an extra penalty to the loss function we are trying to optimize. Thus, there is no tradeoff to be balanced,  so no hyperparameter tuning required. But what is BatchNorm?

BatchNorm was designed to address a phenomenon called "internal covariate shift" which means that even though we normalize the inputs to the network,
the activations in the hidden layers can have their distributions change since the parameters of the network are being updated. Imagine you peform an update to the weights

$$\theta_{i+1} = \theta_i - \alpha \nabla_{\theta} \mathcal{L}$$

This gradient is for a particular mini-batch, and for a particular set of activations corresponding to that mini-batch. If you pass in the same mini-batch at the next step, the activations will have changed since the parameters changed. Thus, the gradient being computed changes, not only because we are in a different region of weight space, but also because

Without any hidden layers, this is not a problem since the input distribution remains constant.

![bn](https://cdn-images-1.medium.com/max/1600/1*Hiq-rLFGDpESpr8QNsJ1jg.png)

Note that the last line says BN is not just making the activations have 0 mean and 1 variance. It actually for the learning of the best possible mean and variance with the parameters $\gamma$ and $\beta$.


### Dropout

When applying dropout at train time to a given layer $i$ the activation $h^{(i)}_{j} = \sigma(z^{(i)}_j)$ becomes $h^{(i)}_{j} = \sigma(z^{(i)}_j) d^{(i)}_j$ for every neuron $h^{(i)}_j$ where $d^{(i)}_j \sim Bernoulli(1- p)$ and $p$ is the dropout rate. This means any neuron has a chance of being omitted from the model for each mini-batch processed ($d$ is sampled from scratch for every mini-batch), so the connections going into it and out of it have no effect.

At test time, we have to scale the activations since so that their expected value is the same as test time. I.e. we take $h^{(i)}_{j} = \sigma(z^{(i)}_j) p$. Note that we are not dropping out neurons at test time.

Dropout is a funny regularization technique with a couple of interpretations.

1. If $n$ is the number of total neurons in the network, the number of neurons at any point during training is approx. $(1-p)n$ where $p \in [0,1]$ is the dropout rate. Thus, the capacity is reduced in a stochastic way.
2. Equally-weighted averaging of exponentially many models with shared weights. 

Regarding the 2nd interpretation, this "ensembling" method implies that neurons will be less correlated since the model cannot rely on the assumption that every neuron will always be present. So if a redundant linear combination of 2 neurons is what the typical SGD solution arrives at, dropout will remove that redundancy since their linear combination cannot be depended on.

**CNNs**

Dropout is not the right regularization technique for CNNs because units in conv layers are spatially correlated, so just dropping random ones doesn't really prevent information from flowing through.

### DropBlock

Very recent dropout technique for CNNs best explained with an image

![bn](https://raw.githubusercontent.com/Jongchan/arxiv-screening/master/images/20181031-DropBlock-0.png)

# Examples of Learning curves

In this section, we look at training a few simple models and observing the effect of hyper-parameters. First, open Tensorflow Playground:

[Tensorflow Playground](https://playground.tensorflow.org/)

You should see 2 hidden-layer classification network. We will use this network to classify two concentric circles. We start by the default parameters, i.e. learning rate=0.03, no regularization, and Tanh activation units.

1. Is this problem **linearly seperable**? **No**, why? Prove using convexity.
* Click on Run and observe the training curves on the right.
* Pause the training half-way and observe the form of the curve.
  - Do train/test curves overlap? yes, so **no overfitting** yet.
  - Has train/test curves flattened? not yet, so probably we can still improve.
  - Look at the data and boundaries, have we learned the perfect boundary? if so, where is the additional improvement in train/test loss is going to come from? Increase in prediction confidence.
* Continue the training until convergence.
  - Where do you define convergence? The improvement in the loss is minimal.
  - Is there a gap between train/test loss? No, why? because the data is generated without noise and comes from the same distribution
* Rerun a few times and look at the train/test curves. Sometimes you should see a mid-way almost **flat regime** in the train/test loss. Can you guess why that happens? Hint: look at the heat-maps of the two neurons in the final layer. It is sometimes because the model has not learned to combine two neurons to get a solution, it has one good neuron and the other one has to learn to fill in the gap of the other one. In this process, that one good neuron might have to change a bit. Basically, we have reached a sub-optimal solution which is not a circle yet, and we have to combine the two neurons to get there and this search can take some wandering around in a **plateau**.
  - More explanation: The flat regime starts early in the training and is brief with lr=0.03 but takes longer using lr=0.003. For the duration of this plateau, the boundary creates 3 disconnected regions, 2 red regions and one blue from two sides in one direction. The drop happens when the red regions are connected into 1 and the blue region is only unbounded by one side. https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.003&regularizationRate=0&noise=0&networkShape=4,2&seed=0.96502&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false
  
  ![Plateau](https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/tutorials/tut6/data/train_loss.png)
* Now let's try to change a few settings.
  - Use **ReLU** activations and retrain. How does the boundary compare to TanH boundaries? More linear boundaries.
  - Use **Sigmoid** activations and retrain. Do you notice a difference in training speed compared ReLU and Tanh? In this example Sigmoid is almost 100x slower to train than ReLU. Generally, Sigmoid and TanH are more challenging to train when used in deep networks. We might need better initialization methods such as *Orthogonal initialization* to make them work. (Pennington et al. 2017) But they are commonly used in recurrent networks such as ones used in language modeling, where we have LSTMs. Note the challenge with Sigmoid units, multiplying the input by a constant can push the inputs to the saturation regime. So initialization is really important.
  - But, don't yet give up on Sigmoid. Always make sure to **fine-tune hyper-parameters** after any change to your network. What if we increase the learning rate? Would Sigmoids converge faster? **Yes**, try learning rate 0.3.
  - Let's test the model with some **noisy data**. Try increasing the noise bar on the left to 15. Train using 3 activations, ReLU, Sigmoid, Tanh and use learning 0.3. Notice how the training curves for ReLU and Tanh **fluctate** while for Sigmoid it doesn't. Now try learning rate 0.01. Do you still see any fluctation?
* Let's try changing the architecture.
  - Let's expand the model. Add 4 more hidden-layers with 2 units each to reach 6-hidden layers. Try training with ReLU, Sigmoid. Is it easier to train? **No**, even though the shallow models we trained are subsets of this deeper model (set new layers' weights to identity), it is more challenging to train the deeper models. This is the justification behind using **Residual connections in ResNets**.
  - How about shrinking the model. Can you get high accuracy using only 1 hidden-layer and 2 ReLU activations? How about 3? Do the same for Sigmoid and Tanh. Is there a difference between the minimum number of units needed?
* Let's try adding regularization.
  - Try the spiral data. See how the model can **overfit**. Try adding regularization. For this data, use a 4-layer network, each with 6 ReLU units. To see overfitting, add noise and decrease the ratio of training examples to test examples.

# Training curves from papers

## Google neural machine translation (Yonghui 2016)
GNMT is a translation system by Google used in http://translate.google.com. This model is an example where Sigmoid and Tanh activations are used. Here is the training curve for GNMT showing the loss vs steps.

![GNMT loss vs steps](https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/tutorials/tut6/data/gnmt_lp.png)

A few interesting points in this plot:
* Adam converges faster at the beginning with less fluctuations compared to SGD. In general, it is common to see Adam performing better at least at the beginning of the training compared to SGD for language modeling. Since GNMT, we now know that AdamW (Loshchilov 2017) can help fix the convergence of Adam in the end and also with its generalization.
* To fix the gap between Adam and SGD, GNMT proposed to switch from Adam to SGD. This switch happens around $0.6\times10^5$ steps. It is interesting to see that the loss temporarily goes up after the switch. In general, one has to be patient with training neural networks and do not judge based on a few iterations of training.

## Deep Residual Networks (He 2016)

ResNets are one of the latest image classification models performing favorably on the ImageNet classification task. The network uses a hierarchy of Convolution layers, ReLUs, and a few other important keys ideas in helping with optimization i.e.\ batch normalization, and residual connections, plus a special initialization for ReLU networks.

![Resnet training curves](https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/tutorials/tut6/data/resnet.png)

Here thin curves are the training error and bold curves are test errors.
Interesting points:
* There are 3 learning rate reductions, each by a factor of $10$. After each learning rate reduction we see a drop in the error. It is important that using a large learning rate we progress fast but we would quickly plateau. We need to drop the learning rate to make less disturbing and subtle changes to the model.
* For the larger model, Resnet-34, we see a gap between training and test performance at the end. Overfitting is only referred to when the validation loss goes up, not when it stays constant.

# References

1. Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. "Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice." Advances in neural information processing systems. 2017.
* Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv preprint arXiv:1609.08144 (2016).
* Loshchilov, Ilya, and Frank Hutter. "Fixing weight decay regularization in adam." arXiv preprint arXiv:1711.05101 (2017).
* He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.