# Best Practices

* [PYTORCH COMMON MISTAKES - How To Save Time](https://www.youtube.com/watch?v=O2wJ3tkc-TU)

1. Train 1000 epochs only with the first batch with batch_size=1 and 32. This is to verify the **train loss gets close to 0**.
2. Check the loss of the first batch is around ```-log(1/num_classes)```.
3. Double check ```optimizer.zero_grad()```.
4. Double check ```model.train(True)``` ... ```model.eval()``` block.
5. Double check ```model.eval()``` before any evaluations (validation, test). Evaluation with ```model.train(True)``` cause inaccuracy e.g. **Dropout** is not disabled.
6. Double check if the loss function uses **logits** or **probability (softmax output)**.
7. Double check if the loss function for multi label classification uses **sparse index to label** or **one hot encoding**
8. Monitor weight distribution mean and variance.
9. Monitor activation distribution mean and variance.

# Bias=False before Normalization Layer

```bias=False``` before **Normalization Layer** as it will zero-center the data.

* [When should you not use the bias in a layer?](https://ai.stackexchange.com/a/27742/45763)

> The BatchNorm layer will re-center the data, removing the bias and making it a useless trainable parameter.

* [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)

> Note that, since we normalize ```𝑊𝑢+𝑏```, the **bias ```𝑏``` can be ignored** since its effect will be canceled by the subsequent mean subtraction.

* [Why are biases (typically) not used in attention mechanism?](https://ai.stackexchange.com/a/40256/45763)

> The reason for this is that these layers are typically followed by a normalization layer, such as Batch Normalization or Layer Normalization. These normalization layers center the data at mean=0 (and std=1), effectively removing any bias.

In [None]:
class ConvNet(nn.Module):
  def __init__(self):
    super().__init__()
    self.layers = nn.Sequential(
      nn.Conv2d(channels, 16, kernel_size=3, padding="same", bias=False),   # <--- Bias=False before Batch Norm
      nn.BatchNorm2d(16),
      nn.MaxPool2d(kernel_size=2),
      nn.ReLU(),
      nn.Flatten(),
      nn.Linear(width * height * 32 // 16, 64, bias=False),                 # <--- Bias=False before Batch Norm
      nn.BatchNorm1d(64),
    )


  def forward(self, x):
    '''Forward pass'''
    return self.layers(x)