# Lec_10_Fashion MNIST
<font size=5><b>Dropout and BatchNormalization<b></font>
<div align='right'> Hoe Sung Ryu ( 류 회 성 ) </div>
    


<img src= https://leejunhyun.github.io/assets/img/TensorFlow/Fashion-MNIST-01.png>
    
---
    
Syllabus

|Event Type|Date|Topic|
|--:|:---:|:---|
|1 |July 27| Environment setting and Python basic|
|2 |July 28| Pytorch basic and Custom Data load |
|3 |July 29| Traditional Machine Learning(1) |
|4 |July 30| Traditional Machine Learning(2) |
|5 |July 31| CNN(Convolutional Neural Network)(1)  |
|6 |Aug 03| CNN(Convolutional NeuralNetwork)(2) |
|7 |Aug 04|  RNN(Recurrent Neural Networks)(1) |
|8 |Aug 05|  RNN(Recurrent Neural Networks)(2) |
|9 |Aug 06|  Transfer learning(VGG pertained on ImageNEt for CIfar-10)| 
|10|Aug 07|**Mini_Kaggle**: Facial Expression Recognition on `AffectNet` | 
|11|Aug 08|`Awards` and `Closing`| 

---





<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Review" data-toc-modified-id="Review-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Review</a></span></li><li><span><a href="#Today" data-toc-modified-id="Today-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Today</a></span></li><li><span><a href="#BatchNormalization" data-toc-modified-id="BatchNormalization-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>BatchNormalization</a></span></li><li><span><a href="#Drop-out" data-toc-modified-id="Drop-out-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Drop out</a></span></li><li><span><a href="#Ordering-of-batch-normalization,-dropout,-Activation" data-toc-modified-id="Ordering-of-batch-normalization,-dropout,-Activation-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Ordering of batch normalization, dropout, Activation</a></span></li><li><span><a href="#Dataset-and-DataLoader" data-toc-modified-id="Dataset-and-DataLoader-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Dataset and DataLoader</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>EDA</a></span></li><li><span><a href="#Model-Build" data-toc-modified-id="Model-Build-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Model Build</a></span></li><li><span><a href="#Set-Loss-and-Optimizer" data-toc-modified-id="Set-Loss-and-Optimizer-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Set Loss and Optimizer</a></span></li><li><span><a href="#Train-and-Test" data-toc-modified-id="Train-and-Test-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Train and Test</a></span></li></ul></div>

---

## Review

<img src='../img/playground.png'>
<center>Visual illustration of connection between neural network architecture, hyperparameters, and dataset characteristics.</center> 
<br>
    
Explore this connection yourself at: https://playground.tensorflow.org/

---

## Today
- Change Batch Size
- Dropout Layer 
- BatchNorm Layer
- Linear-->activation-->BatchNorm-->Dropout Order for MLP

reference:
- https://stackoverflow.com/questions/49433936/how-to-initialize-weights-in-pytorch

##  BatchNormalization
https://kharshit.github.io/blog/2018/12/28/why-batch-normalization


In deep neural networks, you not only have input features but activations in the hidden layers also. Can/Should you normalize them also? The answer is Yes. Normalizing the inputs to hidden layers helps in faster learning. This the core concept of batch normalization
<img src=https://kharshit.github.io/img/batch_normalization.png>

 during training, we normalize each layer’s inputs by using the mean and standard deviation (or variance) of the values in the current batch

## Drop out

"[Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.](http://jmlr.org/papers/v15/srivastava14a.html)"


Large neural networks trained on relatively small datasets can overfit the training data.One approach to reduce overfitting is to fit all possible different neural networks on the same dataset and to average the predictions from each model called `ensemble`. A problem with the `ensemble` approximation is that it requires multiple models to be fit and stored, which can be a challenge if the models are large, requiring days to train.

However, `Dropout` is a regularization method that approximates training a large number of neural networks with different architectures in parallel. During training, some number of layer outputs are randomly ignored. This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer.


<img src=https://miro.medium.com/max/1200/1*iWQzxhVlvadk6VAJjsgXgg.png width=60%>

## Ordering of batch normalization, dropout, Activation


The original `BatchNorm paper` prescribes using `BN` before` ReLU`. The following is the exact text from the paper: <p>
> We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+ b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution.

However, in practice I find that the opposite is true - BN after ReLU consistently performs better. 
_From [Reddit](https://www.reddit.com/r/MachineLearning/comments/67gonq/d_batch_normalization_before_or_after_relu/)_

Hence, I recommend `Linear --> Activation --> BatchNorm --> Dropout`
    
    
reference: https://stackoverflow.com/questions/39691902/ordering-of-batch-normalization-and-dropout#:~:text=For%20example%2C%20if%20the%20shift,that%20shift%20may%20be%20off.

## Dataset and DataLoader

In [None]:
import torch
from torchvision import datasets, transforms

# Download and load the training data
trainset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/',
                                 download=False,
                                 train=True, 
                                 transform=transforms.Compose([transforms.ToTensor(),
                                                               transforms.Normalize((0.5),
                                                                                    (0.5))]))

testset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/',
                                 download=False,
                                 train=False, 
                                 transform=transforms.Compose([transforms.ToTensor(),
                                                               transforms.Normalize((0.5),
                                                                                    (0.5))]))

In [None]:
# Recommand: change batch_size 2^n
# TODO 
train_loader = 
test_loader = 

## EDA

In [None]:
import matplotlib.pyplot as plt

In [None]:
labels_map = {0 : 'T-Shirt', 1 : 'Trouser', 2 : 'Pullover', 3 : 'Dress', 4 : 'Coat', 5 : 'Sandal', 6 : 'Shirt',
              7 : 'Sneaker', 8 : 'Bag', 9 : 'Ankle Boot'};
fig = plt.figure(figsize=(8,8));
columns = 4;
rows = 5;
for i in range(1, columns*rows +1):
    img_xy = np.random.randint(len(trainset));
    img = trainset[img_xy][0][0,:,:]
    fig.add_subplot(rows, columns, i)
    plt.title(labels_map[trainset[img_xy][1]])
    plt.axis('off')
    plt.imshow(img, cmap='gray')
plt.show()

In [None]:
# check shape 
images, labels = next(iter(train_loader))
print('data shape', images.shape)
print('label shape', labels.shape)

In [None]:
print('data shape', images.shape)

In [None]:
images.view(images.shape[0], -1).shape

## Model Build

``` 
model name = BaseNet
Input, Hidden, output == (n_in=28*28*1, n_mid =80, n_out=10) 
1. Fully-connected-layer 
2. Activation(ReLU)
3. Fully-connected-layer 
4. Activation(ReLu)
5. Fully-connected-layer 

```



In [None]:
import torch.nn as nn
# TODO 

In [None]:
model = 
print(model)

In [None]:
total_params = sum(p.numel() for p in model.parameters())
print("Num of Total Parameter : ",total_params)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Num of Trainable Parameter :",trainable_params)

## Set Loss and Optimizer 
```
Loss : CrossEntropyLoss
Optimizer: Adam , lr= 1e-3
```

In [None]:
### Set Loss and Optimizer 
import torch.nn as nn
import torch.optim as optim
loss_fn =
optimizer = 

## Train and Test

- model.train() == (Trian mode)
- model.eval() == (Evalutaion mode)

In [None]:
def train(epoch):
    model.train()# 학습일때는 학습모드로 설정

    for data,targets in train_loader:
        # 1)
        # 2)
        outputs = model(data) # 데이터를 입력하고 출력을 계산
        loss = loss_fn(outputs,targets) # 출력과 학습 데이터의 정답 간의 오차를 계산
        # 3 ) 오차를 역전파하여 계산함
        # 4) 역전파 계산한 값으로 가중치를 수정

    print(f'[TRAIN] Epoch {epoch} \t Loss: {loss.item():1.5f}',end=' ')

In [None]:
def test(epoch):
    model.eval() # 추론할때는 추론모드로! Dropout이나 Batch-Norm과 같은 기법에선 특히!
    correct = 0
    
   # 1)
        for data, targets in test_loader:
            # 2) 
            outputs = model(data)
            _, predicted = torch.max(outputs.data,1) # select maximum probability 
            correct += predicted.eq(targets.data.view_as(predicted)).sum() # if collect +1 

    # Accuracy print
    data_num = len(test_loader.dataset) 
    print(f'[TEST] Epoch {epoch} \t Accuracy: {correct}/{data_num} ({100.*correct/data_num :3.5f}%)')

In [None]:
Epochs = 10

for epoch in range(Epochs):
    train(epoch+1)
    test(epoch+1)

<div class="alert alert-success" data-title="">
  <h1><i class="fa fa-tasks" aria-hidden="true"></i> Exercise :  3 problems 
  </h1>
</div>

**Compare loss with below models: (10 epochs)**

1. Linear - Activation - DropOut - BatchNorm - Linear - Activation - DropOut - BatchNorm - Linear -> Output

2. Linear - Activation - BatchNorm - Dropout - Linear - Activation - BatchNorm - BatchNorm - Linear--> Output

3. Linear - BatchNorm - Activation - Dropout - Linear - Activation - BatchNorm - Dropout - Linear --> Output